Overview and cleaning
Start with a quick overview of the data set:
glimpse(ukb)
## Observations: 152,533
## Variables: 42
## $ pcd_nospaces <chr> "BR11AB", ...
## $ NextGenerationAccessNGAAvailabilitybyPCpremises <int> 0, 0, 3, 3...
## $ SFBB30MbitsavailabilitybyPCpremises <int> 0, 0, 3, 3...
## $ Numberofconnections2MbitsbyPCnumberoflines <chr> "0", "0", ...
## $ Numberofconnections210MbitsbyPCnumberoflines <chr> "0", "1", ...
## $ Numberofconnections1030MbitsbyPCnumberoflines <chr> "5", "16",...
## $ Numberofconnections30MbitsbyPCnumberoflines <chr> "0", "0", ...
## $ Average download speed MbitsbyPC <chr> "18.8", "1...
## $ MediandownloadspeedMbitsbyPC <chr> "18.8", "2...
## $ MinimumdownloadspeedMbitsbyPC <chr> "15.1", "8...
## $ MaximumdownloadspeedMbitsbyPC <chr> "24", "24"...
## $ TechSpecificAveragedownloadspeedMbitsforNGAlinesby <chr> "N/A", "N/...
## $ TechSpecificAveragedownloadspeedMbitsfornonNGAlines <chr> "18.8", "1...
## $ TechSpecificAveragedownloadspeedMbitsforSFBBlinesby <chr> "N/A", "N/...
## $ TechSpecificAveragedownloadspeedMbitsforBasicBBNon <chr> "18.8", "1...
## $ AverageuploadspeedsMbitsbyPC <chr> "1.1", "1....
## $ MedianuploadspeedMbitsbyPC <chr> "1.1", "1....
## $ MinimumuploadspeedMbitsbyPC <chr> "1.1", "1"...
## $ MaximumuploadspeedMbitsbyPC <chr> "1.2", "1....
## $ TechSpecificAverageuploadspeedMbitsforNGAlinesbyPC <chr> "N/A", "N/...
## $ TechSpecificAverageuploadspeedMbitsfornonNGAlinesb <chr> "1.1", "1....
## $ TechSpecificAverageuploadspeedMbitsforSFBBlinesby <chr> "N/A", "N/...
## $ TechSpecificAverageuploadspeedMbitsforBasicBBNonS <chr> "1.1", "1....
## $ pcd <chr> "BR1 1AB",...
## $ pcd2 <chr> "BR1 1AB"...
## $ pcds <chr> "BR1 1AB",...
## $ oslaua <chr> "E09000006...
## $ osward <chr> "E05000109...
## $ gor <chr> "E12000007...
## $ lsoa01 <chr> "E01000676...
## $ msoa01 <chr> "E02000134...
## $ oa11 <chr> "E00003255...
## $ lsoa11 <chr> "E01000676...
## $ msoa11 <chr> "E02000134...
## $ lat <dbl> 51.40627, ...
## $ long <dbl> 0.015177, ...
## $ X1 <chr> NA, NA, NA...
## $ X2 <chr> NA, NA, NA...
## $ X3 <chr> NA, NA, NA...
## $ X4 <chr> NA, NA, NA...
## $ X5 <chr> NA, NA, NA...
## $ X6 <chr> NA, NA, NA...
This shows several problems with the data that will have to be removed here:
ukb[,37:42] <- NULL
ukb[ukb == "N/A" ] = NA
ukb <- na.omit(ukb)
remove_lower_than <- function(x){
str_replace(x, '<', '')
}
ukb <- mutate_all(ukb, funs(remove_lower_than))
cols <- c(2:23, 35:36)
ukb[,cols] <- ukb[,cols] %>% lapply(function(x) as.numeric(as.character(x)))
Adding variables
Data is aggregated on several geographic level, but I want to add two new ones to create simple choropleths in Tableau:
ukb$Postcode.District <- gsub(" .*$", "", ukb$pcds)
ukb$Postcode.Area <- str_extract(ukb$pcds, "^[A-Z]+")
Initial results
Get the top postcodes by median download and upload speeds:
ukb %>% select(pcds, MediandownloadspeedMbitsbyPC) %>% arrange(desc(MediandownloadspeedMbitsbyPC)) %>% head()
## # A tibble: 6 x 2
## pcds MediandownloadspeedMbitsbyPC
## <chr> <dbl>
## 1 N16 5ED 152
## 2 BR1 2JE 120
## 3 N21 1RL 120
## 4 N6 4LL 120
## 5 BR2 9TQ 100
## 6 CR0 6SB 100
ukb %>% select(pcds, MedianuploadspeedMbitsbyPC) %>% arrange(desc(MedianuploadspeedMbitsbyPC)) %>% head()
## # A tibble: 6 x 2
## pcds MedianuploadspeedMbitsbyPC
## <chr> <dbl>
## 1 BR1 2GD 20
## 2 BR1 2GE 20
## 3 BR3 5YR 20
## 4 BR6 0NW 20
## 5 CR8 5GA 20
## 6 DA16 3J 20
Create a map in leaflet showing median download speeds: Only first 100 rows were used in order to make the report generate faster.
library(leaflet)
leaflet(data = ukb[1:100,]) %>% addTiles() %>% addMarkers(~long, ~lat,
clusterOptions = markerClusterOptions(), popup = ~as.character(MediandownloadspeedMbitsbyPC)
)
Writing the cleaned file
write.csv(ukb, "ukb.csv", row.names = F)
Display the dashboard created in Tableau