Analytics Bulletin

Thursday, 29 September 2016

Battling Infectious Diseases in the 20th Century: The Impact of Vaccines

The number of infected people, measured over 70-some years and across all 50 states and the District of Columbia, generally declined after vaccines were introduced.

The heat maps below show number of cases per 100,000 people.

Tool used: Tableau

The data link is here: Link1 Link 2

Note: The CDC did not report any weekly, state level data for pertussis between 1955 and 1974. The vaccine was introduced in 1914.

Note: The vaccine for smallpox was introduced in 1800.

NYPD Vehicle Collisions Using python

The data link is here

import pandas as pd

csv_path = 'https://raw.githubusercontent.com/nygeog/data/master/nyc_crashes/data/NYPD_Motor_Vehicle_Collisions.csv'
inCSV = 'data/NYPD_Motor_Vehicle_Collisions.csv'
ouCSV = 'data/nypd_mv_collisions.csv'

df = pd.read_csv(inCSV).rename(columns=lambda x: x.lower())

#drop ones w/out valid lat #super lazy, just grabbing lat's above 35
df = df[(df.latitude > 35)]
#print df.dtypes 
print len(df.index)
df.head(5)

`The original CSV data (first 5 records)`




date time borough zip code latitude longitude location on street name cross street name off street name ... contributing factor vehicle 2 contributing factor vehicle 3 contributing factor vehicle 4 contributing factor vehicle 5 unique key vehicle type code 1 vehicle type code 2 vehicle type code 3 vehicle type code 4 vehicle type code 5

1 02/13/2015 21:45 MANHATTAN 10002 40.715622 -73.994275 (40.7156221, -73.9942752) FORSYTH STREET CANAL STREET NaN ... NaN NaN NaN NaN 3168577 PASSENGER VEHICLE UNKNOWN NaN NaN NaN
2 02/13/2015 21:45 MANHATTAN 10001 40.747535 -73.988307 (40.7475349, -73.9883068) WEST 31 STREET BROADWAY NaN ... Fatigued/Drowsy NaN NaN NaN 3169163 TAXI TAXI NaN NaN NaN
3 02/13/2015 21:45 BRONX 10462 40.833558 -73.857732 (40.8335582, -73.8577325) WESTCHESTER AVENUE PUGSLEY AVENUE NaN ... Unspecified NaN NaN NaN 3169251 PASSENGER VEHICLE SPORT UTILITY / STATION WAGON NaN NaN NaN
4 02/13/2015 21:44 MANHATTAN 10017 40.748800 -73.969846 (40.7487997, -73.969846) EAST 42 STREET 1 AVENUE NaN ... Other Vehicular NaN NaN NaN 3169176 PASSENGER VEHICLE PASSENGER VEHICLE NaN NaN NaN
5 02/13/2015 21:40 STATEN ISLAND 10304 40.617295 -74.080479 (40.6172954, -74.0804791) PARK HILL AVENUE OSGOOD AVENUE NaN ... Unspecified NaN NaN NaN 3169614 PASSENGER VEHICLE PASSENGER VEHICLE NaN NaN NaN


5 rows × 29 columns

`Mergin' the date (day) and time (hours)`



#create datetime http://stackoverflow.com/questions/17978092/combine-date-and-time-columns-using-python-pandas
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])


df = df[['datetime','latitude','longitude']]
df = df[(df.datetime > '2014-01-01 00:00:01')] #query out only data from 2014 onward
df = df.sort('datetime')
df.to_csv(ouCSV,index=False)
print len(df.index)

192324 #count of records 2014+

df.head(5)

`Clean and small data to import to CartoDB`




datetime latitude longitude

228854 2014-01-01 00:01:00 40.725432 -73.996771
228850 2014-01-01 00:01:00 40.767889 -73.981512
228855 2014-01-01 00:01:00 40.750844 -73.978608
229144 2014-01-01 00:15:00 40.588646 -73.992452
228849 2014-01-01 00:20:00 40.689019 -73.986157

	datetime	latitude	longitude
228854	2014-01-01 00:01:00	40.725432	-73.996771
228850	2014-01-01 00:01:00	40.767889	-73.981512
228855	2014-01-01 00:01:00	40.750844	-73.978608
229144	2014-01-01 00:15:00	40.588646	-73.992452
228849	2014-01-01 00:20:00	40.689019	-73.986157

`Maps in CartoDB`



PostGIS SQL Statement to grab 2014 from the 2014-2015/02/13 Table

SELECT * FROM nypd_mv_collisions_2014_20150213 WHERE datetime BETWEEN '2014-01-01 00:00:00' and '2015-01-01 00:00:00'



Density Hex Bins of 2014 Collisions

I'm not totally sure if when creating the classes if it ignores hex bins with no collisions or what. Need to look into that.

`Density of 2014 Collisions over time (using Torque)`



Animation of Density of Collisions. I think these are Kernel Densities but not sure, need to check CartoDB documentation.

`Valentine's Day 2014 Collisions (Clickbait)`

UK broadband speeds - EDA in R and Tableau

This analysis will show how to clean the dataset, run the initial analysis in R, and then use the exported data to create an interactive dashboard in Tableau.

Loading data

First I load the necessary packages and read the data

library(dplyr)
library(readr)
library(stringr)

ukb <- read_csv("G:\\Praxis Class\\PM (SQL HADOOP)\\DVL Blog\\1\\fixed-broadband-speeds-postcode.csv")

Overview and cleaning

Start with a quick overview of the data set:

glimpse(ukb)

## Observations: 152,533
## Variables: 42
## $ pcd_nospaces                                        <chr> "BR11AB", ...
## $ NextGenerationAccessNGAAvailabilitybyPCpremises     <int> 0, 0, 3, 3...
## $ SFBB30MbitsavailabilitybyPCpremises                 <int> 0, 0, 3, 3...
## $ Numberofconnections2MbitsbyPCnumberoflines          <chr> "0", "0", ...
## $ Numberofconnections210MbitsbyPCnumberoflines        <chr> "0", "1", ...
## $ Numberofconnections1030MbitsbyPCnumberoflines       <chr> "5", "16",...
## $ Numberofconnections30MbitsbyPCnumberoflines         <chr> "0", "0", ...
## $ Average download speed MbitsbyPC                    <chr> "18.8", "1...
## $ MediandownloadspeedMbitsbyPC                        <chr> "18.8", "2...
## $ MinimumdownloadspeedMbitsbyPC                       <chr> "15.1", "8...
## $ MaximumdownloadspeedMbitsbyPC                       <chr> "24", "24"...
## $ TechSpecificAveragedownloadspeedMbitsforNGAlinesby  <chr> "N/A", "N/...
## $ TechSpecificAveragedownloadspeedMbitsfornonNGAlines <chr> "18.8", "1...
## $ TechSpecificAveragedownloadspeedMbitsforSFBBlinesby <chr> "N/A", "N/...
## $ TechSpecificAveragedownloadspeedMbitsforBasicBBNon  <chr> "18.8", "1...
## $ AverageuploadspeedsMbitsbyPC                        <chr> "1.1", "1....
## $ MedianuploadspeedMbitsbyPC                          <chr> "1.1", "1....
## $ MinimumuploadspeedMbitsbyPC                         <chr> "1.1", "1"...
## $ MaximumuploadspeedMbitsbyPC                         <chr> "1.2", "1....
## $ TechSpecificAverageuploadspeedMbitsforNGAlinesbyPC  <chr> "N/A", "N/...
## $ TechSpecificAverageuploadspeedMbitsfornonNGAlinesb  <chr> "1.1", "1....
## $ TechSpecificAverageuploadspeedMbitsforSFBBlinesby   <chr> "N/A", "N/...
## $ TechSpecificAverageuploadspeedMbitsforBasicBBNonS   <chr> "1.1", "1....
## $ pcd                                                 <chr> "BR1 1AB",...
## $ pcd2                                                <chr> "BR1  1AB"...
## $ pcds                                                <chr> "BR1 1AB",...
## $ oslaua                                              <chr> "E09000006...
## $ osward                                              <chr> "E05000109...
## $ gor                                                 <chr> "E12000007...
## $ lsoa01                                              <chr> "E01000676...
## $ msoa01                                              <chr> "E02000134...
## $ oa11                                                <chr> "E00003255...
## $ lsoa11                                              <chr> "E01000676...
## $ msoa11                                              <chr> "E02000134...
## $ lat                                                 <dbl> 51.40627, ...
## $ long                                                <dbl> 0.015177, ...
## $ X1                                                  <chr> NA, NA, NA...
## $ X2                                                  <chr> NA, NA, NA...
## $ X3                                                  <chr> NA, NA, NA...
## $ X4                                                  <chr> NA, NA, NA...
## $ X5                                                  <chr> NA, NA, NA...
## $ X6                                                  <chr> NA, NA, NA...

This shows several problems with the data that will have to be removed here:

# remove empty columns
ukb[,37:42] <- NULL

# remove string "N/A" with NAs
ukb[ukb == "N/A" ] = NA

# keep only complete cases
ukb <- na.omit(ukb)

# remove "<"
remove_lower_than <-  function(x){
  str_replace(x, '<', '')
}

ukb <- mutate_all(ukb, funs(remove_lower_than))

# change the columns 2:23 to numeric
cols <- c(2:23, 35:36)
ukb[,cols] <- ukb[,cols] %>% lapply(function(x) as.numeric(as.character(x)))

Adding variables

Data is aggregated on several geographic level, but I want to add two new ones to create simple choropleths in Tableau:

# add the postcode district
ukb$Postcode.District <- gsub(" .*$", "", ukb$pcds)

# regex to extract postcode area
ukb$Postcode.Area <- str_extract(ukb$pcds, "^[A-Z]+")

Initial results

Get the top postcodes by median download and upload speeds:

ukb %>% select(pcds, MediandownloadspeedMbitsbyPC) %>% arrange(desc(MediandownloadspeedMbitsbyPC)) %>% head()

## # A tibble: 6 x 2
##      pcds MediandownloadspeedMbitsbyPC
##     <chr>                        <dbl>
## 1 N16 5ED                          152
## 2 BR1 2JE                          120
## 3 N21 1RL                          120
## 4  N6 4LL                          120
## 5 BR2 9TQ                          100
## 6 CR0 6SB                          100

ukb %>% select(pcds, MedianuploadspeedMbitsbyPC) %>% arrange(desc(MedianuploadspeedMbitsbyPC)) %>% head()

## # A tibble: 6 x 2
##      pcds MedianuploadspeedMbitsbyPC
##     <chr>                      <dbl>
## 1 BR1 2GD                         20
## 2 BR1 2GE                         20
## 3 BR3 5YR                         20
## 4 BR6 0NW                         20
## 5 CR8 5GA                         20
## 6 DA16 3J                         20

Create a map in leaflet showing median download speeds: Only first 100 rows were used in order to make the report generate faster.

library(leaflet)

leaflet(data = ukb[1:100,]) %>% addTiles() %>% addMarkers(~long, ~lat,
  clusterOptions = markerClusterOptions(), popup = ~as.character(MediandownloadspeedMbitsbyPC)
)

Writing the cleaned file

write.csv(ukb, "ukb.csv", row.names = F)

Display the dashboard created in Tableau

Monday, 29 August 2016

Google Viz

GeoChartID26f01b803502

Data: StateMilk • Chart ID: GeoChartID26f01b803502 • googleVis-0.6.0
R version 3.3.1 (2016-06-21) • Google Terms of Use • Documentation and Data Policy

Friday, 19 August 2016

Thursday, 4 August 2016

Baseball Analytics: An Introduction to Sabermetrics using Python

Sabermetrics is the apllication of statistical analysis to baseball data in order to measure in-game activity. The term Sabermetrics comes from saber (Society for American Baseball Research) and metrics (as in econometrics).

The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.

In 2003, Michael Lewis published Moneyball about Billy Beane, the Oakland Athletics General Manager since 1997. The book was centered around Billy Beane's use of Sabemetrics to identify and recruit under-valued baseball players. With this strategy, his team could achieve as many wins as teams with more than double the payroll. The figures below show the relationship between team salaries and number of wins for years: 1997, 2001, 2003, 2013. The green dot represents the Oakland Athletics, the blue dot represents the New York Yankees, and the red dot represents The Boston Red Sox. We can see that the Oakland Athletics went from the underperforming team in 1997, to became a highly competitive team with a comparable number of wins to the New York Yankees. The Oakland Athletics made it to the play-offs in 4 successive years: 2000,2001,2002,2003.

Getting the data and setting up your machine

For this blog post, I will use Lahman’s Baseball Database and Python programming language to explain some of the techniques used in Sabermetrics. This Database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. You can download the data from this this link. I will be using two files from this dataset:Salaries.csv and Teams.csv. To execute the code from this blog, I will use 5 Python libraries: Numpy, Scipy, Pandas and Matplotlib and statsmodels.

Numpy- open source extension module for Python.It provides fast pre-compiled functions for numerical routines.

It adds support to Python for large, multi-dimensional arrays and matrices. Besides that it supplies a large library of high-level mathematical functions to operate on these arrays.

Scipy-SciPy is widely used in scientific and technical computing. It contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.

Pandas-Pandas is a library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas is free software released under the three-clause BSD license.

Matplotlib- matplotlib is a plotting library for NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like wxPython, Qt, or GTK+.

Statsmodels- Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator