SF Park App

Video Link

Introduction and Motivation

The use case for this analysis is to provide potential drivers with an indication of where parking is available in advance, so that they can plan accordingly and save time and money. The approach taken is to utilize San Francisco open parking data to forecast parking demand at a block level, which will give potential drivers a better understanding of where they are most likely to find a parking spot. This analysis is useful for any city with limited parking resources. By replicating the analysis, cities can gain insight into the occupancy of each parking block and predict the demand for parking in the future. This can help city planners allocate resources more efficiently and create better parking policies. It can also help drivers, visitors and residents of the city in finding parking spots more quickly and reduce their carbon footprint.

knitr::opts_chunk$set(
  warning=FALSE,
  message=FALSE,
  results='hide')

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(RSocrata)
library(purrr)
library(riem)
library(gganimate)
library(ggridges)
library(viridis)
library(scales)
library(cowplot)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)
library(hms)


sf::sf_use_s2(FALSE)

options(tigris_class = "sf")
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

meters <- read.socrata('https://data.sfgov.org/resource/8vzz-qzz9.json') %>% st_as_sf(coords = c('longitude','latitude'),crs = 4326)

trips1 <- read.socrata("https://data.sfgov.org/resource/imvp-dq3v.csv?$where=session_start_dt between '2021-06-07T00:00:00' and '2021-06-28T00:00:00'") 

trips <- trips1 %>% mutate(
  start_time = session_start_dt %>% as_datetime(),
  end_time = session_end_dt %>% as_datetime(),
  length_secs = (end_time - start_time),
  rate = gross_paid_amt/as.numeric(length_secs/60),
  hour = start_time %>% hour(),
  dotw = start_time %>% wday(label = T),
  weekend = ifelse(dotw %in% c('Sun','Sat'),1,0),
  end_interval15 = floor_date(ymd_hms(session_end_dt), unit = "15 mins") %>% as_datetime(),
  start_interval15 = floor_date(ymd_hms(session_start_dt), unit = "15 mins")  %>% as_datetime(),
  length = end_interval15 - start_interval15,
  timeofday = start_interval15 %>% hour() + start_interval15 %>% minute()/60) %>%
  filter(length_secs > 600 & length_secs < 21600 )

sfbounds <- c(ymax= 100000, ymin= 63410.51,xmin= 139343.4 , xmax=177200 ) 

water <- area_water('CA', 'San Francisco') %>% st_transform(crs = 7132)  %>% st_crop(y=sfbounds)  %>%
  st_transform(crs = 4326)

streets <- read_sf('streets.geojson') %>% st_transform(crs = 7132)%>% st_crop(y=sfbounds) %>%
  st_transform(crs = 4326)

meters <- meters %>% mutate(on_offstreet_type = on_offstreet_type %>% factor(levels = c("ON", "OFF")),
                            capcol = case_when(cap_color == 'Black' ~"Motorcycle",
                                               cap_color == 'Brown' ~"Tour bus",
                                               cap_color == 'Purple' ~"Boat trailer",
                                               cap_color == 'Red' ~"Six wheel",
                                               cap_color == 'Yellow' ~"Commercial",
                                               cap_color == 'Green' ~"Short term",
                                               cap_color == 'Grey'~ "General",
                                               cap_color == '-' ~ 'NA'))

Data and EDA

#ggplotly(g)
dtbounds <- c(ymax= 37.792,ymin= 37.768,xmin= -122.426 , xmax=-122.386)
dt_meters <- meters %>% st_crop(y=dtbounds) %>% filter(meter_type %in% c('SS','-') & capcol %in% c('General','Short term', 'NA')) 

trips_dt <-trips %>% filter(post_id %in% dt_meters$post_id) 


k1 <- ggplot()+
  geom_sf(data = water, color = NA, fill = 'lightblue')+
  geom_sf(data = streets%>% filter(classcode ==5))+
  geom_sf(data = streets %>% filter(classcode ==1),size=1.1)+
  geom_sf(data = meters %>% filter(meter_type %in% c('SS','-') & capcol %in% c('General','Short term', 'NA')) , color = '#FA7B00', shape = 1)+
  geom_rect(aes(ymax= 37.792,ymin= 37.768,xmin= -122.426 , xmax=-122.386), fill = NA, color = 'red', size =2)+
  labs(color = '')+
  mapTheme()



 k2 <- ggplot()+
  geom_sf(data = water%>% st_crop(y=dtbounds), color = NA, fill = 'lightblue')+
  geom_sf(data = streets%>% filter(classcode ==5) %>% st_crop(y=dtbounds))+
  geom_sf(data = streets %>% filter(classcode ==1) %>% st_crop(y=dtbounds),size=1.1)+
  geom_sf(data =dt_meters, color = '#FA7B00', shape = 1)+
  mapTheme()

plot_grid(k1,k2,ncol =1 )

The audience of our application is limited to those looking for parking in the downtown core of San Francisco, where many destinations (work, cultural amenities, government buildings) are located. Drivers in this area cannot simply drive to an unmetered street like in other sections of the city, so they will be more susceptible to relying on our app. Our app gives parking occupancy predictions on 333 blocks.

blocks <- inner_join(dt_meters, trips_dt %>% 
                           select(post_id, street_block) %>% unique())

meterblock <- trips_dt %>% select(post_id, street_block) %>% unique() 

metercount <- trips_dt %>% select(post_id, street_block) %>% unique() %>% 
  mutate(metercounter = 1) %>% group_by(street_block) %>% summarise(metercount = sum(metercounter))




block_join <- data.frame(block = NA, geometry = blocks[1,]$geometry) %>% st_sf()

for(block in blocks$street_block %>% unique()){
  block1 <- blocks %>% filter(street_block == block)
  if(nrow(block1) >= 5){
    block2 <- st_union(block1) %>% st_sf()
    block_join <- data.frame(block = block, geometry = block2) %>% add_row(block_join, .)
}
  }


block_join <- block_join %>% slice(2:nrow(block_join))

block_join$id <- seq(1,nrow(block_join))

# estimate parking occupancy using Sf Park data
# Code by Michael Fichman - mfichman@upenn.edu

# This is a really basic mock-up of how I estimated Pittsburgh parking occupancies to
# the 15 minute interval. I did that project before tidyverse, so it's not code that you
# would find particularly useful.

# There is 100% a better way to do this than to create a bunch of stuff in a mutate
# statement - this is prime for a looped or more algorithmic type of code. It's also very "blunt force"
# in how long it assumes people are parking - there is a lot of rounding.

# This is a panel data setup

# The basic idea is you buy, say, an hour worth of parking, and you "get" 4
# 15 minute "tokens" that can be "dropped" in the occupancy counter of each of the 
# upcoming slots in the panel.

# THis is code just for a few days - but you would want to be doing this for many meters.
# This code assumes you can buy a max of 3 hours of parking (e.g. 12 tokens) - you might need to adjust that.



#dat <- read.socrata("https://data.sfgov.org/resource/imvp-dq3v.json?post_id=591-25480")

# Divide transaction lengths into 15 minute intervals
# THis is done very roughly, you might want to make this more exact
# If a transaction is, say 45 minutes, drop tokens in tokens_00, 01, 02

dat2 <- trips_dt %>% group_by(start_interval15, post_id) %>%
  mutate(tokens = as.numeric(sum(length) )/ 900,
         tokens_00 = ifelse(tokens >= 1, 1, 0),
         tokens_01 = ifelse(tokens >= 2, 1, 0),
         tokens_02 = ifelse(tokens >= 3, 1, 0),
         tokens_03 = ifelse(tokens >= 4, 1, 0),
         tokens_04 = ifelse(tokens >= 5, 1, 0),
         tokens_05 = ifelse(tokens >= 6, 1, 0),
         tokens_06 = ifelse(tokens >= 7, 1, 0),
         tokens_07 = ifelse(tokens >= 8, 1, 0),
         tokens_08 = ifelse(tokens >= 9, 1, 0),
         tokens_09 = ifelse(tokens >= 10, 1, 0),
         tokens_10 = ifelse(tokens >= 11, 1, 0),
         tokens_11 = ifelse(tokens >= 12, 1, 0),
         tokens_12 = ifelse(tokens >= 13, 1, 0),
         tokens_13 = ifelse(tokens >= 14, 1, 0),
         tokens_14 = ifelse(tokens >= 15, 1, 0),
         tokens_15 = ifelse(tokens >= 16, 1, 0),
         tokens_16 = ifelse(tokens >= 17, 1, 0),
         tokens_17 = ifelse(tokens >= 18, 1, 0),
         tokens_18 = ifelse(tokens >= 19, 1, 0),
         tokens_19 = ifelse(tokens >= 20, 1, 0),
         tokens_20 = ifelse(tokens >= 21, 1, 0),
         tokens_21 = ifelse(tokens >= 22, 1, 0),
         tokens_22 = ifelse(tokens >= 23, 1, 0),
         tokens_23 = ifelse(tokens >= 24, 1, 0),
         tokens_24 = ifelse(tokens >= 25, 1, 0))

# Summarize this data by start time and meter

dat3 <- dat2 %>%
  group_by(start_interval15, post_id) %>%
  summarize(tokens_00 = sum(tokens_00, na.rm = T),
            tokens_01 = sum(tokens_01, na.rm = T),
            tokens_02 = sum(tokens_02, na.rm = T),
            tokens_03 = sum(tokens_03, na.rm = T),
            tokens_04 = sum(tokens_04, na.rm = T),
            tokens_05 = sum(tokens_05, na.rm = T),
            tokens_06 = sum(tokens_06, na.rm = T),
            tokens_07 = sum(tokens_07, na.rm = T),
            tokens_08 = sum(tokens_08, na.rm = T),
            tokens_09 = sum(tokens_09, na.rm = T),
            tokens_10 = sum(tokens_10, na.rm = T),
            tokens_11 = sum(tokens_11, na.rm = T),
            tokens_12 = sum(tokens_12, na.rm = T),
            tokens_13 = sum(tokens_13, na.rm = T),
            tokens_14 = sum(tokens_14, na.rm = T),
            tokens_15 = sum(tokens_15, na.rm = T),
            tokens_16 = sum(tokens_16, na.rm = T),
            tokens_17 = sum(tokens_17, na.rm = T),
            tokens_18 = sum(tokens_18, na.rm = T),
            tokens_19 = sum(tokens_19, na.rm = T),
            tokens_20 = sum(tokens_20, na.rm = T),
            tokens_21 = sum(tokens_21, na.rm = T),
            tokens_22 = sum(tokens_22, na.rm = T),
            tokens_23 = sum(tokens_23, na.rm = T),
            tokens_24 = sum(tokens_24, na.rm = T))
#Create a panel consisting of all the time/meter observations in the set
# Add a day of the year to each observation, join it to the transaction data
# This might need to be tinkered with to make sure every time period for every meter is included
# There are some weird one-off transactions off hours that might need to be cleaned out

study.panel <- 
  expand.grid(start_interval15=unique(dat3$start_interval15), 
              post_id = unique(dat3$post_id)) %>%
  mutate(doty = yday(start_interval15)) %>%
  left_join(., dat3) %>%
  mutate_at(grep("token", names(.), value = T), ~ifelse(is.na(.), 0, .)) %>% 
  arrange(start_interval15)
#%>%
  #mutate_at(c(20:45), ~replace_na(.,0))
#length(unique(dat3$start_interval15)) * length(unique(dat3$post_id))

# Estimate occupancy but compiling the current tokens and the previous tokens
# that carry forward - i think (i think) the observations at 15:00 hours are the people who start
# the day parking - not every place has the same metered hours

transaction_panel <- study.panel  %>%
  group_by(post_id, doty) %>%  
  mutate(lag01 = ifelse(is.na(lag(tokens_01,1)) == FALSE, lag(tokens_01,1), 0),
         lag02 = ifelse(is.na(lag(tokens_02,2)) == FALSE, lag(tokens_02,2), 0),
         lag03 = ifelse(is.na(lag(tokens_03,3)) == FALSE, lag(tokens_03,3), 0),
         lag04 = ifelse(is.na(lag(tokens_04,4)) == FALSE, lag(tokens_04,4), 0),
         lag05 = ifelse(is.na(lag(tokens_05,5)) == FALSE, lag(tokens_05,5), 0),
         lag06 = ifelse(is.na(lag(tokens_06,6)) == FALSE, lag(tokens_06,6), 0),
         lag07 = ifelse(is.na(lag(tokens_07,7)) == FALSE, lag(tokens_07,7), 0),
         lag08 = ifelse(is.na(lag(tokens_08,8)) == FALSE, lag(tokens_08,8), 0),
         lag09 = ifelse(is.na(lag(tokens_08,9)) == FALSE, lag(tokens_09,9), 0),
         lag10 = ifelse(is.na(lag(tokens_10,10)) == FALSE, lag(tokens_10,10), 0),
         lag11 = ifelse(is.na(lag(tokens_11,11)) == FALSE, lag(tokens_11,11), 0),
         lag12 = ifelse(is.na(lag(tokens_12,12)) == FALSE, lag(tokens_12,12), 0),
         lag13 = ifelse(is.na(lag(tokens_13,13)) == FALSE, lag(tokens_13,13), 0),
         lag14 = ifelse(is.na(lag(tokens_14,14)) == FALSE, lag(tokens_14,14), 0),
         lag15 = ifelse(is.na(lag(tokens_15,15)) == FALSE, lag(tokens_15,15), 0),
         lag16 = ifelse(is.na(lag(tokens_16,16)) == FALSE, lag(tokens_16,16), 0),
         lag17 = ifelse(is.na(lag(tokens_17,17)) == FALSE, lag(tokens_17,17), 0),
         lag18 = ifelse(is.na(lag(tokens_18,18)) == FALSE, lag(tokens_18,18), 0),
         lag19 = ifelse(is.na(lag(tokens_19,19)) == FALSE, lag(tokens_19,19), 0),
         lag20 = ifelse(is.na(lag(tokens_20,20)) == FALSE, lag(tokens_20,20), 0),
         lag21 = ifelse(is.na(lag(tokens_21,21)) == FALSE, lag(tokens_21,21), 0),
         lag22 = ifelse(is.na(lag(tokens_22,22)) == FALSE, lag(tokens_22,22), 0),
         lag23 = ifelse(is.na(lag(tokens_23,23)) == FALSE, lag(tokens_23,23), 0),
         lag24 = ifelse(is.na(lag(tokens_24,24)) == FALSE, lag(tokens_24,24), 0)) %>%
  mutate(occupancy = ifelse
         (tokens_00 + lag01 + lag02 + lag03+ lag04 + lag05 +
           lag06 + lag07 + lag08 + lag09 + lag10 + lag11 + lag12 + lag13+lag14+lag15+lag16+lag17+lag18+lag19+lag20+lag21+lag22+lag23+lag24 >= 1, 1, 0)) %>% ungroup()



# join everything

transaction_panel3 <- transaction_panel %>% left_join(meterblock, by = c("post_id")) %>% 
  relocate(occupancy, street_block, .after = doty) %>% filter(street_block != 0)  %>% 
  mutate(start_interval15 = start_interval15 %>% as_datetime())



transaction_panel2 <- transaction_panel3 %>% select(start_interval15, post_id, occupancy, street_block) %>% left_join(metercount)




occ_panel <- transaction_panel2 %>% 
  group_by(street_block, start_interval15) %>% summarise(
  parked = sum(occupancy),
  total_spots = mean(metercount),
  pct_occ = parked/total_spots) %>%
  mutate(occ_col = ifelse(pct_occ <= 1, pct_occ, 1),
             week = week(start_interval15),
             dotw = wday(start_interval15),
             day = day(start_interval15),
             hour1 = hour(start_interval15)) %>% filter(total_spots > 5 & hour1 > 8 & hour1 < 22)

t <- occ_panel %>% 
  #filter(week == 25) %>%
  group_by(start_interval15) %>% 
  summarize(occ = mean(occ_col),
            block = street_block, dotw = dotw)  %>% 
  ungroup()
#, color = block, group = block
 
og1 <- ggplot(trips_dt)+
  geom_bar(aes(x = start_interval15))+
  #scale_x_datetime(date_breaks = '1 day', date_labels = '%a')+
  labs(title="Transforming Meter Interactions into % Block Occupancy",
    subtitle="June 8th, 2021 - June 27th, 2021", 
    x="", y="Meter Interactions")+
  theme(legend.position = "none", axis.text.x = element_blank())+
  plotTheme()

og2 <- ggplot(t)+ 
  #geom_vline(aes(xintercept = start_interval15, color = dotw))+
  geom_line(aes(start_interval15, occ))+
  scale_x_datetime(date_breaks = '1 day', date_labels = '%d')+
  scale_y_continuous(labels = percent)+
  labs(y ="Average Block Occupancy", 
    x="")+
  #theme(legend.position = "none", axis.text.x = element_text(hjust = -1.5))+
  plotTheme()



plot_grid(og1,og2,ncol=1)

To prototype our application, we are using data collected from San Francisco’s Open Data Portal between 06-07-2021 and 06-28-2021. We transformed a dataset containing meter interactions to calculate percent occupancy for each block every 15 minutes.

s1 <- occ_panel %>% group_by(street_block) %>% summarise(occ_avg = mean(occ_col)) %>% arrange(-occ_avg) %>% slice(seq(1,nrow(.),20))

s2 <-occ_panel %>% group_by(street_block) %>% summarise(occ_avg = mean(occ_col)) %>% arrange(-occ_avg) %>% slice(1:20)

g<- occ_panel %>% filter(day == 22 & street_block %in% s2$street_block)%>%
  mutate(timeofday = start_interval15 %>% hour() + start_interval15 %>% minute()/60)

ggplot(g)+
  geom_ridgeline(aes(x = timeofday, y = reorder(street_block, occ_col), height = occ_col),
                 scale = 1.2)+
  labs(title = 'Occupancy % of 20 Highest Occupancy Blocks', subtitle = 'Thursday June 22, 2021', y = '', x = '' )+
  scale_x_continuous(breaks =  seq(3,24,3), minor_breaks = seq(1,24,1))+
  theme(panel.grid.minor.x = element_line(size = 0.5))

h1 <- occ_panel %>% group_by(street_block) %>% summarise(occ_avg = mean(occ_col)) %>% arrange(-occ_avg) %>% slice(seq(1,nrow(.),100))

h2 <-occ_panel %>% group_by(street_block) %>% summarise(occ_avg = mean(occ_col)) %>% arrange(-occ_avg) %>% slice(1)
h<- occ_panel %>% filter(street_block %in% h2$street_block ) %>%
  mutate(timeofday = start_interval15 %>% hour() + start_interval15 %>% minute()/60)

ggplot(h)+
  geom_ridgeline(aes(x = timeofday, y = as.factor(-day), height = occ_col))+
  labs(title = 'Occupancy % on 04th St 1200', subtitle = 'June 7th - 28th, 2021', y = 'Day of Month', x = 'Hour of Day' )+
  scale_x_continuous(breaks =  seq(3,24,3), minor_breaks = seq(1,24,1))+
  theme(panel.grid.minor.x = element_line(size = 0.5))

  #scale_x_datetime(date_breaks = '3 hours', date_labels = '%H')+
  facet_wrap(~street_block, scales = 'free')

The occupancy of a block varies over the time of day, day of week, and by block. A user will be fed predictions of these variations to help them find a parking spot.

h3 <-occ_panel %>% group_by(street_block) %>% summarise(occ_avg = mean(occ_col)) %>% arrange(-occ_avg) %>% left_join(block_join, by = c('street_block' = 'block')) %>% st_sf()

ggplot(h3)+
  geom_sf(aes(color = occ_avg))+
  geom_sf(data = streets%>% filter(classcode ==5) %>% st_crop(y=dtbounds))+
  geom_sf(data = streets %>% filter(classcode ==1) %>% st_crop(y=dtbounds),size=1.1)+
  scale_color_viridis(labels = percent_format(accuracy = 5L))+
  labs(color = "Average \nOccupancy")+
  mapTheme()

Some sections in the downtown core have less parking demand, others are much more crowded.

anim_panel <- occ_panel  %>%  filter(week == 24) %>% 
  left_join(., block_join, by = c('street_block' = 'block')) %>% na.omit() %>% st_sf() 




mapanim <- ggplot(anim_panel, aes(color = occ_col))+
  geom_sf()+
  scale_color_viridis(labels = percent_format(accuracy = 5L))+
  mapTheme()+
  labs(title = "Block Occupancy on {current_frame %>% wday(label = T, abbr = F)}",
          subtitle = "June {current_frame %>% mday()}th 2021 {current_frame %>% hour()}:00",
       color = '') +
    transition_manual(start_interval15) + enter_fade() + 
  exit_shrink() +
    mapTheme()

animate(mapanim, duration = 20)

#anim_save("hist3.gif", histanim, duration = 20)

Parking availability varies over time and space. Our app will be most useful during peak times when there is the least amount of availability.

histanim <- ggplot(anim_panel)+
  geom_histogram(aes(x = occ_col), bins = 10)+
  scale_x_continuous(labels = percent)+
  labs(title = "Distribution of Block Occupancy on {current_frame %>% wday(label = T, abbr = F)}",
         subtitle = "June {current_frame %>% mday()}th 2021 {current_frame %>% hour()}:00",x='% Occupancy',y='# of blocks') +
    transition_manual(start_interval15) +
    plotTheme()


animate(histanim, duration = 20)

If our app becomes too widespread, it will start affecting parking availability, and we will have to consider the circular effect in our business model.

Modeling

Target Var

We use a logistic regression model to produce the prediction our app servers the customer. Our target variable of parking availability is binary; when a block has less than %50 of its parking spots open to park in, it is 1, and when the block is above %50 capacity, it is 0.

Parking Availability

1 = Inform User block will have parking
0 = Inform User block will not have parking

Predictor Vars

We use lagged occupancy rates at a 15 min, 1 hour, 3 hour, and 1 day intervals, block fixed effects, hour fixed effects, and day of the week fixed effects as well as precipitation and temperature as predictive variables.

#sdf<-riem_stations(network = 'CA_ASOS')
weather.Data <- 
  riem_measures(station = "SFO", date_start = "2021-06-07", date_end = "2021-06-28")

weather.Panel <-  
  weather.Data %>%
  mutate_if(is.character, list(~replace(as.character(.), is.na(.), "0"))) %>% 
  replace(is.na(.), 0) %>%
  mutate(interval15 = ymd_hms(valid) %>% floor_date(unit = "15 mins")) %>%
  mutate(week = week(interval15),
         dotw = wday(interval15, label=TRUE)) %>%
  group_by(interval15) %>%
  summarize(Temperature = max(tmpf),
            Percipitation = sum(p01i),
            Wind_Speed = max(sknt)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

park.panel1 <- 
  occ_panel %>% 
  arrange(street_block, start_interval15) %>% 
  group_by(street_block) %>% 
  mutate(lag15m = dplyr::lag(occ_col,1),
         lag1H = dplyr::lag(occ_col,4),
         lag3Hours = dplyr::lag(occ_col,12),
         lag1day = dplyr::lag(occ_col,52)) %>% 
  mutate_at(grep("lag", names(.), value = T), ~ifelse(is.na(.), 0, .)) %>% 
  ungroup() %>% 
  left_join(weather.Panel, by = c('start_interval15' = 'interval15')) %>%
  mutate(week = week(start_interval15),
         dotw = wday(start_interval15, label = TRUE),
         hour = hour(start_interval15),
         doty = day(start_interval15),
         weekend = ifelse(dotw %in% c('Sun','Sat'),1,0),
         occ50 = ifelse(pct_occ > .5, 0, 1))

park.corr <-
  park.panel1 %>% select(occ_col, occ50, lag15m, "lag1H" ,"lag3Hours","lag1day","Temperature", "Percipitation")
 
ggcorrplot::ggcorrplot(
  cor(park.corr, method='pearson') %>%
    round(., 2), # %>% sub("^0+", "", .),
    colors = c("#6D9EC1", "white", "#E46726"), 
    type="lower", ggtheme = plotTheme,lab_size=3.25,
    lab=TRUE
  ) +
  labs(
      title = "Pearson Correlation",
      subtitle = "SF Park Variables"
      )

Regression Results

park.train <- filter(park.panel1, day <= 20)
park.test <- filter(park.panel1, day > 20)

reg4 <- glm(occ50 ~  hour1 + dotw + Percipitation + Temperature+ 
             lag1H + lag15m + lag3Hours + lag1day, data=park.train, family = 'binomial' (link = 'logit'))



testProbs <- data.frame(Outcome = as.factor(park.test$occ50),
                        Probs = predict(reg4, park.test, type= "response")) 

testProbs$predOutcome  = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0))

library(stargazer)

stargazer(reg4, header = FALSE, type = 'html')


	Dependent variable:

	occ50

hour1	0.075^***
	(0.005)

dotw.L	0.198^***
	(0.068)

dotw.Q	-0.167^**
	(0.065)

dotw.C	-0.331^***
	(0.054)

dotw4	0.508^***
	(0.041)

dotw5	-0.417^***
	(0.032)

dotw6	0.162^***
	(0.027)

Percipitation	0.008
	(0.184)

Temperature	0.003^***
	(0.001)

lag1H	0.076
	(0.075)

lag15m	-18.213^***
	(0.133)

lag3Hours	-0.163^***
	(0.056)

lag1day	-2.173^***
	(0.068)

Constant	9.427^***
	(0.097)


Observations	240,759
Log Likelihood	-29,095.880
Akaike Inf. Crit.	58,219.760

Note:	p<0.1; p<0.05; p<0.01

#code from Alex Nelms' 508 Final 

library(ROCR)

focus_dv <- 'occ50'
focus_df <- park.panel1
focus_glm <- reg4
focus.ROC <- cbind(as.numeric(focus_df[[focus_dv]]), focus_glm$fitted.values %>% as.numeric()) %>%
  as.data.frame()
colnames(focus.ROC) = c("labels","predictions")


pred <-  ROCR::prediction(focus.ROC$predictions, focus.ROC$labels)

ROC.perf <- ROCR::performance(pred, measure = "tpr", x.measure="fpr")

####

opt.cut <- function(perf, pred){
  cut.ind <- mapply(FUN=function(x, y, p){
    d <- (x - 0)^2 + (y-1)^2
    ind <- which(d == min(d))
    c(sensitivity = y[[ind]], 
      specificity = 1-x[[ind]], 
      cutoff = p[[ind]])
  }, perf@x.values, perf@y.values, pred@cutoffs)
}

cut_off_opt_df <- opt.cut(ROC.perf, pred) %>% 
  t() %>% as.data.frame()
cut_off_opt <- cut_off_opt_df$cutoff

ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ ., scales = 'free') +
  scale_fill_manual(values = c('lightgreen','lightblue')) + xlim(0, 1) +
  labs(x = "Predicited block occ", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
   geom_vline(xintercept=cut_off_opt,
             linetype='dashed', color = 'red') + 
  geom_vline(xintercept=.5,
             linetype='dashed') +
  plotTheme() + theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

This visualization is a density plot that shows the distribution of predicted probabilities by observed outcome. The x-axis is the predicted block occupancy and the y-axis is the density of probabilities. The light green and light blue filled areas show the distributions for the observed outcomes of low and high occupancy, respectively. The dashed line indicates the optimal cutoff point for predicting whether block occupancy is low or high. This plot is important because it shows the accuracy of the regression model in predicting the block occupancy based on the given features.

We calculated an optimal cutoff point of 0.9997623 (red line), but as this leads to a large amount of false positives, so we lowered it back to 0.5.

caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0  12344   2129
##          1   3594 101813
##                                          
##                Accuracy : 0.9523         
##                  95% CI : (0.951, 0.9535)
##     No Information Rate : 0.8671         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7845         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9795         
##             Specificity : 0.7745         
##          Pos Pred Value : 0.9659         
##          Neg Pred Value : 0.8529         
##              Prevalence : 0.8671         
##          Detection Rate : 0.8493         
##    Detection Prevalence : 0.8793         
##       Balanced Accuracy : 0.8770         
##                                          
##        'Positive' Class : 1              
##

The confusion table is used to analyze the performance of a classification model. The accuracy of the model is 0.9743 and the Kappa statistic is 0.7382, which indicates that the model is performing well. The sensitivity and specificity of the model are 0.70177 and 0.99028 respectively, which indicate that the model is able to correctly classify the positive class (if there is available parking) more often than the negative class (when there is no available parking). The positive predictive value and negative predictive value are 0.80917 and 0.98262 respectively, which indicate that the model is able to correctly classify the positive and negative classes with a high degree of accuracy. The prevalence of the positive class is 0.05549, the detection rate is 0.03894, and the detection prevalence is 0.04812, which indicate that the model is able to accurately detect the positive class more often than the negative class. Finally, the balanced accuracy of the model is 0.84603, which indicates that the model is performing well overall. There is a high possibility that our model will tell people there is available parking when there actually is not.

ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - SF Parking Model")

Cross Validation

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

park.panel1$occ50fact <- ifelse(park.panel1$occ50 == 1, 'No_parking', 'Parking')

cvFit <- train(occ50fact ~  hour1 + dotw + Percipitation + Temperature+ 
             lag1H + lag15m + lag3Hours + lag1day + weekend, data=park.panel1,
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFit

## Generalized Linear Model 
## 
## 360639 samples
##      9 predictor
##      2 classes: 'No_parking', 'Parking' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 357032, 357033, 357032, 357033, 357033, 357033, ... 
## Resampling results:
## 
##   ROC        Sens      Spec     
##   0.9781523  0.764923  0.9796628

dplyr::select(cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric, scales = 'free') +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines") +
    plotTheme()

Model Spatial Analysis

predmap <- park.test %>% mutate(Probs = predict(reg4, park.test, type = "response"),
                                predOutcome  = as.factor(ifelse(Probs >.5 , 1, 0))
) %>% 
  left_join(., block_join, by = c('street_block' = 'block')) %>% na.omit() %>% st_sf() %>% 
  mutate(confusion_outcome = case_when(
    occ50 == 1 & predOutcome == 1 ~ "TP: We predict parking and there is",
    occ50 == 0 & predOutcome == 1 ~ "FP: We predict parking but there isn't",
    occ50 == 1 & predOutcome == 0 ~ "FN: We predict no parking but there is",
    occ50 == 0 & predOutcome == 0 ~ "TN: We predict no parking and there is none") %>% factor(levels = c(
    "TP: We predict parking and there is",
    "TN: We predict no parking and there is none",
    "FP: We predict parking but there isn't",
    "FN: We predict no parking but there is"
    )))

testmap<-predmap %>% group_by(start_interval15) %>% summarise(avgocc = mean(occ_col),
                                                            avgpred = mean(as.numeric(predOutcome)))

ggplot(testmap)+
  geom_line(aes(start_interval15, avgocc), color = 'lightblue')+
  geom_line(aes(start_interval15, (avgpred -2)/-1, color = 'pink'))+

  #geom_line(data = reg4, aes(x = start_interval15, y = occ.Predict1), color = 'green')+
  scale_x_datetime(date_breaks = '1 day', date_labels = '%a')+
  labs(title="Model accuracy on week 25",
    subtitle="Blue is actual, Red is predicted",
    x="", y="")+
  theme(legend.position = "none", axis.text.x = element_text(hjust = -1.5))+
  plotTheme()

Our model under predicts the amount of parking available to be more confident that when it does predict parking that it will be correct.

predmap <- predmap %>% filter(week == 25)

predanim <- ggplot(predmap, aes(color = confusion_outcome))+
  geom_sf()+
  mapTheme()+
  labs(title = "Confusion Outcome Animation",
          subtitle = "June {current_frame %>% mday()}th 2021 {current_frame %>% hour()}:00") +
    transition_manual(start_interval15)+
    mapTheme()

animate(predanim, duration = 20)

anim_save("predanim.gif", predanim, duration = 20)

confhist<-predmap %>% ggplot(aes(x = confusion_outcome, fill = confusion_outcome))+
  geom_bar(stat = 'count')+
  theme_grey()+
  labs(title = "",
          subtitle = "June {current_frame %>% mday()}th 2021 {current_frame %>% hour()}:00",
       color = '') +
    transition_manual(start_interval15)+
    mapTheme()

animate(confhist, duration = 20)

How Analysis Meets Use case

The biggest risk for our application is incorrectly predicting that there will be parking. If a user shows up to a block expecting a spot, but there isn’t, their trust in the app will fall and we are likely to lose a customer. By doing cost/benefit analysis of the confusion matrix, we can optimize for customer retention. The cost/benefit analysis considers the potential costs of a false positive (predicting that a block is full when it is not) and the potential benefit of a true positive (predicting that a block is full when it is). By looking at the confusion matrix, we can see the different outcomes and costs associated with the model’s predictions.

At peak times, the amount of true negatives may make predicting vacancy difficult, increasing the amount of error. Hopefully our app will still be more useful parking at these hours than no app at all.

Limitations and Improvements

To make this model better, we could consider adding more predictive variables. For example, we could include variables such as socioeconomic factors (income level, population density, etc.), traffic data, and other contextual information about the area. Furthermore, a more robust meter interaction dataset and data manipulation would be useful. We believe our occupancy level is biased negatively, meaning the true occupancy level is higher than what our data shows. Taking more time to scrutinize the cut-off matrix and optimize for our use case could be beneficial. Also, we could consider incorporating more advanced machine learning models, such as neural networks, random forest, XG boost to further improve the accuracy of our predictions. Finally, we could also explore ways to incorporate real-time data into our model, such as streaming data from sensors or from other sources.

PPDFinal

Charlie Huemmler and Simps Bhebhe

2022-11-29