Thursday, 29 September 2016

UK broadband speeds - EDA in R and Tableau

This analysis will show how to clean the dataset, run the initial analysis in R, and then use the exported data to create an interactive dashboard in Tableau.

Loading data

First I load the necessary packages and read the data
library(dplyr)
library(readr)
library(stringr)
ukb <- read_csv("G:\\Praxis Class\\PM (SQL HADOOP)\\DVL Blog\\1\\fixed-broadband-speeds-postcode.csv")

Overview and cleaning

Start with a quick overview of the data set:
glimpse(ukb)
## Observations: 152,533
## Variables: 42
## $ pcd_nospaces                                        <chr> "BR11AB", ...
## $ NextGenerationAccessNGAAvailabilitybyPCpremises     <int> 0, 0, 3, 3...
## $ SFBB30MbitsavailabilitybyPCpremises                 <int> 0, 0, 3, 3...
## $ Numberofconnections2MbitsbyPCnumberoflines          <chr> "0", "0", ...
## $ Numberofconnections210MbitsbyPCnumberoflines        <chr> "0", "1", ...
## $ Numberofconnections1030MbitsbyPCnumberoflines       <chr> "5", "16",...
## $ Numberofconnections30MbitsbyPCnumberoflines         <chr> "0", "0", ...
## $ Average download speed MbitsbyPC                    <chr> "18.8", "1...
## $ MediandownloadspeedMbitsbyPC                        <chr> "18.8", "2...
## $ MinimumdownloadspeedMbitsbyPC                       <chr> "15.1", "8...
## $ MaximumdownloadspeedMbitsbyPC                       <chr> "24", "24"...
## $ TechSpecificAveragedownloadspeedMbitsforNGAlinesby  <chr> "N/A", "N/...
## $ TechSpecificAveragedownloadspeedMbitsfornonNGAlines <chr> "18.8", "1...
## $ TechSpecificAveragedownloadspeedMbitsforSFBBlinesby <chr> "N/A", "N/...
## $ TechSpecificAveragedownloadspeedMbitsforBasicBBNon  <chr> "18.8", "1...
## $ AverageuploadspeedsMbitsbyPC                        <chr> "1.1", "1....
## $ MedianuploadspeedMbitsbyPC                          <chr> "1.1", "1....
## $ MinimumuploadspeedMbitsbyPC                         <chr> "1.1", "1"...
## $ MaximumuploadspeedMbitsbyPC                         <chr> "1.2", "1....
## $ TechSpecificAverageuploadspeedMbitsforNGAlinesbyPC  <chr> "N/A", "N/...
## $ TechSpecificAverageuploadspeedMbitsfornonNGAlinesb  <chr> "1.1", "1....
## $ TechSpecificAverageuploadspeedMbitsforSFBBlinesby   <chr> "N/A", "N/...
## $ TechSpecificAverageuploadspeedMbitsforBasicBBNonS   <chr> "1.1", "1....
## $ pcd                                                 <chr> "BR1 1AB",...
## $ pcd2                                                <chr> "BR1  1AB"...
## $ pcds                                                <chr> "BR1 1AB",...
## $ oslaua                                              <chr> "E09000006...
## $ osward                                              <chr> "E05000109...
## $ gor                                                 <chr> "E12000007...
## $ lsoa01                                              <chr> "E01000676...
## $ msoa01                                              <chr> "E02000134...
## $ oa11                                                <chr> "E00003255...
## $ lsoa11                                              <chr> "E01000676...
## $ msoa11                                              <chr> "E02000134...
## $ lat                                                 <dbl> 51.40627, ...
## $ long                                                <dbl> 0.015177, ...
## $ X1                                                  <chr> NA, NA, NA...
## $ X2                                                  <chr> NA, NA, NA...
## $ X3                                                  <chr> NA, NA, NA...
## $ X4                                                  <chr> NA, NA, NA...
## $ X5                                                  <chr> NA, NA, NA...
## $ X6                                                  <chr> NA, NA, NA...
This shows several problems with the data that will have to be removed here:
# remove empty columns
ukb[,37:42] <- NULL

# remove string "N/A" with NAs
ukb[ukb == "N/A" ] = NA

# keep only complete cases
ukb <- na.omit(ukb)

# remove "<"
remove_lower_than <-  function(x){
  str_replace(x, '<', '')
}

ukb <- mutate_all(ukb, funs(remove_lower_than))

# change the columns 2:23 to numeric
cols <- c(2:23, 35:36)
ukb[,cols] <- ukb[,cols] %>% lapply(function(x) as.numeric(as.character(x)))

Adding variables

Data is aggregated on several geographic level, but I want to add two new ones to create simple choropleths in Tableau:
# add the postcode district
ukb$Postcode.District <- gsub(" .*$", "", ukb$pcds)

# regex to extract postcode area
ukb$Postcode.Area <- str_extract(ukb$pcds, "^[A-Z]+")

Initial results

Get the top postcodes by median download and upload speeds:
ukb %>% select(pcds, MediandownloadspeedMbitsbyPC) %>% arrange(desc(MediandownloadspeedMbitsbyPC)) %>% head()
## # A tibble: 6 x 2
##      pcds MediandownloadspeedMbitsbyPC
##     <chr>                        <dbl>
## 1 N16 5ED                          152
## 2 BR1 2JE                          120
## 3 N21 1RL                          120
## 4  N6 4LL                          120
## 5 BR2 9TQ                          100
## 6 CR0 6SB                          100
ukb %>% select(pcds, MedianuploadspeedMbitsbyPC) %>% arrange(desc(MedianuploadspeedMbitsbyPC)) %>% head()
## # A tibble: 6 x 2
##      pcds MedianuploadspeedMbitsbyPC
##     <chr>                      <dbl>
## 1 BR1 2GD                         20
## 2 BR1 2GE                         20
## 3 BR3 5YR                         20
## 4 BR6 0NW                         20
## 5 CR8 5GA                         20
## 6 DA16 3J                         20
Create a map in leaflet showing median download speeds: Only first 100 rows were used in order to make the report generate faster.
library(leaflet)

leaflet(data = ukb[1:100,]) %>% addTiles() %>% addMarkers(~long, ~lat,
  clusterOptions = markerClusterOptions(), popup = ~as.character(MediandownloadspeedMbitsbyPC)
)


Writing the cleaned file


write.csv(ukb, "ukb.csv", row.names = F)
Display the dashboard created in Tableau


1 comment: