Thursday, 29 September 2016

Battling Infectious Diseases in the 20th Century: The Impact of Vaccines

The number of infected people, measured over 70-some years and across all 50 states and the District of Columbia, generally declined after vaccines were introduced.
The heat maps below show number of cases per 100,000 people.
Tool used: Tableau
The data link is here: Link1  Link 2






Note: The CDC did not report any weekly, state level data for pertussis between 1955 and 1974. The vaccine was introduced in 1914.





Note: The vaccine for smallpox was introduced in 1800.



NYPD Vehicle Collisions Using python


The data link is here

import pandas as pd

csv_path = 'https://raw.githubusercontent.com/nygeog/data/master/nyc_crashes/data/NYPD_Motor_Vehicle_Collisions.csv'
inCSV = 'data/NYPD_Motor_Vehicle_Collisions.csv'
ouCSV = 'data/nypd_mv_collisions.csv'

df = pd.read_csv(inCSV).rename(columns=lambda x: x.lower())

#drop ones w/out valid lat #super lazy, just grabbing lat's above 35
df = df[(df.latitude > 35)]
#print df.dtypes 
print len(df.index)
df.head(5)

The original CSV data (first 5 records)

datetimeboroughzip codelatitudelongitudelocationon street namecross street nameoff street name...contributing factor vehicle 2contributing factor vehicle 3contributing factor vehicle 4contributing factor vehicle 5unique keyvehicle type code 1vehicle type code 2vehicle type code 3vehicle type code 4vehicle type code 5
102/13/201521:45MANHATTAN1000240.715622-73.994275(40.7156221, -73.9942752)FORSYTH STREETCANAL STREETNaN...NaNNaNNaNNaN3168577PASSENGER VEHICLEUNKNOWNNaNNaNNaN
202/13/201521:45MANHATTAN1000140.747535-73.988307(40.7475349, -73.9883068)WEST 31 STREETBROADWAYNaN...Fatigued/DrowsyNaNNaNNaN3169163TAXITAXINaNNaNNaN
302/13/201521:45BRONX1046240.833558-73.857732(40.8335582, -73.8577325)WESTCHESTER AVENUEPUGSLEY AVENUENaN...UnspecifiedNaNNaNNaN3169251PASSENGER VEHICLESPORT UTILITY / STATION WAGONNaNNaNNaN
402/13/201521:44MANHATTAN1001740.748800-73.969846(40.7487997, -73.969846)EAST 42 STREET1 AVENUENaN...Other VehicularNaNNaNNaN3169176PASSENGER VEHICLEPASSENGER VEHICLENaNNaNNaN
502/13/201521:40STATEN ISLAND1030440.617295-74.080479(40.6172954, -74.0804791)PARK HILL AVENUEOSGOOD AVENUENaN...UnspecifiedNaNNaNNaN3169614PASSENGER VEHICLEPASSENGER VEHICLENaNNaNNaN
5 rows × 29 columns



Mergin' the date (day) and time (hours)

#create datetime http://stackoverflow.com/questions/17978092/combine-date-and-time-columns-using-python-pandas
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])


df = df[['datetime','latitude','longitude']]
df = df[(df.datetime > '2014-01-01 00:00:01')] #query out only data from 2014 onward
df = df.sort('datetime')
df.to_csv(ouCSV,index=False)
print len(df.index)

192324 #count of records 2014+

df.head(5)

Clean and small data to import to CartoDB

datetimelatitudelongitude
2288542014-01-01 00:01:0040.725432-73.996771
2288502014-01-01 00:01:0040.767889-73.981512
2288552014-01-01 00:01:0040.750844-73.978608
2291442014-01-01 00:15:0040.588646-73.992452
2288492014-01-01 00:20:0040.689019-73.986157



Maps in CartoDB

PostGIS SQL Statement to grab 2014 from the 2014-2015/02/13 Table

SELECT * FROM nypd_mv_collisions_2014_20150213 WHERE datetime BETWEEN '2014-01-01 00:00:00' and '2015-01-01 00:00:00'

Density Hex Bins of 2014 Collisions

I'm not totally sure if when creating the classes if it ignores hex bins with no collisions or what. Need to look into that.


Density of 2014 Collisions over time (using Torque)

Animation of Density of Collisions. I think these are Kernel Densities but not sure, need to check CartoDB documentation.



Valentine's Day 2014 Collisions (Clickbait)















UK broadband speeds - EDA in R and Tableau

This analysis will show how to clean the dataset, run the initial analysis in R, and then use the exported data to create an interactive dashboard in Tableau.

Loading data

First I load the necessary packages and read the data
library(dplyr)
library(readr)
library(stringr)
ukb <- read_csv("G:\\Praxis Class\\PM (SQL HADOOP)\\DVL Blog\\1\\fixed-broadband-speeds-postcode.csv")

Overview and cleaning

Start with a quick overview of the data set:
glimpse(ukb)
## Observations: 152,533
## Variables: 42
## $ pcd_nospaces                                        <chr> "BR11AB", ...
## $ NextGenerationAccessNGAAvailabilitybyPCpremises     <int> 0, 0, 3, 3...
## $ SFBB30MbitsavailabilitybyPCpremises                 <int> 0, 0, 3, 3...
## $ Numberofconnections2MbitsbyPCnumberoflines          <chr> "0", "0", ...
## $ Numberofconnections210MbitsbyPCnumberoflines        <chr> "0", "1", ...
## $ Numberofconnections1030MbitsbyPCnumberoflines       <chr> "5", "16",...
## $ Numberofconnections30MbitsbyPCnumberoflines         <chr> "0", "0", ...
## $ Average download speed MbitsbyPC                    <chr> "18.8", "1...
## $ MediandownloadspeedMbitsbyPC                        <chr> "18.8", "2...
## $ MinimumdownloadspeedMbitsbyPC                       <chr> "15.1", "8...
## $ MaximumdownloadspeedMbitsbyPC                       <chr> "24", "24"...
## $ TechSpecificAveragedownloadspeedMbitsforNGAlinesby  <chr> "N/A", "N/...
## $ TechSpecificAveragedownloadspeedMbitsfornonNGAlines <chr> "18.8", "1...
## $ TechSpecificAveragedownloadspeedMbitsforSFBBlinesby <chr> "N/A", "N/...
## $ TechSpecificAveragedownloadspeedMbitsforBasicBBNon  <chr> "18.8", "1...
## $ AverageuploadspeedsMbitsbyPC                        <chr> "1.1", "1....
## $ MedianuploadspeedMbitsbyPC                          <chr> "1.1", "1....
## $ MinimumuploadspeedMbitsbyPC                         <chr> "1.1", "1"...
## $ MaximumuploadspeedMbitsbyPC                         <chr> "1.2", "1....
## $ TechSpecificAverageuploadspeedMbitsforNGAlinesbyPC  <chr> "N/A", "N/...
## $ TechSpecificAverageuploadspeedMbitsfornonNGAlinesb  <chr> "1.1", "1....
## $ TechSpecificAverageuploadspeedMbitsforSFBBlinesby   <chr> "N/A", "N/...
## $ TechSpecificAverageuploadspeedMbitsforBasicBBNonS   <chr> "1.1", "1....
## $ pcd                                                 <chr> "BR1 1AB",...
## $ pcd2                                                <chr> "BR1  1AB"...
## $ pcds                                                <chr> "BR1 1AB",...
## $ oslaua                                              <chr> "E09000006...
## $ osward                                              <chr> "E05000109...
## $ gor                                                 <chr> "E12000007...
## $ lsoa01                                              <chr> "E01000676...
## $ msoa01                                              <chr> "E02000134...
## $ oa11                                                <chr> "E00003255...
## $ lsoa11                                              <chr> "E01000676...
## $ msoa11                                              <chr> "E02000134...
## $ lat                                                 <dbl> 51.40627, ...
## $ long                                                <dbl> 0.015177, ...
## $ X1                                                  <chr> NA, NA, NA...
## $ X2                                                  <chr> NA, NA, NA...
## $ X3                                                  <chr> NA, NA, NA...
## $ X4                                                  <chr> NA, NA, NA...
## $ X5                                                  <chr> NA, NA, NA...
## $ X6                                                  <chr> NA, NA, NA...
This shows several problems with the data that will have to be removed here:
# remove empty columns
ukb[,37:42] <- NULL

# remove string "N/A" with NAs
ukb[ukb == "N/A" ] = NA

# keep only complete cases
ukb <- na.omit(ukb)

# remove "<"
remove_lower_than <-  function(x){
  str_replace(x, '<', '')
}

ukb <- mutate_all(ukb, funs(remove_lower_than))

# change the columns 2:23 to numeric
cols <- c(2:23, 35:36)
ukb[,cols] <- ukb[,cols] %>% lapply(function(x) as.numeric(as.character(x)))

Adding variables

Data is aggregated on several geographic level, but I want to add two new ones to create simple choropleths in Tableau:
# add the postcode district
ukb$Postcode.District <- gsub(" .*$", "", ukb$pcds)

# regex to extract postcode area
ukb$Postcode.Area <- str_extract(ukb$pcds, "^[A-Z]+")

Initial results

Get the top postcodes by median download and upload speeds:
ukb %>% select(pcds, MediandownloadspeedMbitsbyPC) %>% arrange(desc(MediandownloadspeedMbitsbyPC)) %>% head()
## # A tibble: 6 x 2
##      pcds MediandownloadspeedMbitsbyPC
##     <chr>                        <dbl>
## 1 N16 5ED                          152
## 2 BR1 2JE                          120
## 3 N21 1RL                          120
## 4  N6 4LL                          120
## 5 BR2 9TQ                          100
## 6 CR0 6SB                          100
ukb %>% select(pcds, MedianuploadspeedMbitsbyPC) %>% arrange(desc(MedianuploadspeedMbitsbyPC)) %>% head()
## # A tibble: 6 x 2
##      pcds MedianuploadspeedMbitsbyPC
##     <chr>                      <dbl>
## 1 BR1 2GD                         20
## 2 BR1 2GE                         20
## 3 BR3 5YR                         20
## 4 BR6 0NW                         20
## 5 CR8 5GA                         20
## 6 DA16 3J                         20
Create a map in leaflet showing median download speeds: Only first 100 rows were used in order to make the report generate faster.
library(leaflet)

leaflet(data = ukb[1:100,]) %>% addTiles() %>% addMarkers(~long, ~lat,
  clusterOptions = markerClusterOptions(), popup = ~as.character(MediandownloadspeedMbitsbyPC)
)


Writing the cleaned file


write.csv(ukb, "ukb.csv", row.names = F)
Display the dashboard created in Tableau