replicating nimby – build_the_linear_model

This code:

implements all the transformed data into two linear models for evaluation
plots the residuals and linear regression

This is where it all comes together! After all the time and effort to construct my data pipeline and calculating variables…. running the linear model function was so simple yet illuminating.

Fitting Linear Model One:

where the average change in crime rate is the dependent variable, and the average change in dispensary density is the independent variable.

Code

# import data
source("~/gvsu/winter 23/CIS 635/NIMBY/cleaning/linear_model/variables/crime and dispenaries per capita.R")

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.2.1     ✔ dplyr   1.1.1
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: 'lubridate'


The following objects are masked from 'package:base':

    date, intersect, setdiff, union


`summarise()` has grouped output by 'Geography'. You can override using the `.groups` argument.
Getting data from the 2010-2014 5-year ACS

Code

model1 <- lm(linear_variables$crime_yoy_avg ~ linear_variables$dsp_yoy_avg,  
             data = linear_variables)

Interpreting the Results | Mode 1

adapted from “Residual Plots in R”

Code

library(tidyverse) # ggplot

# plot it
qplot(linear_variables$crime_yoy_avg, linear_variables$dsp_yoy_avg) + 
  geom_smooth(method = "lm") +
  xlab("Average Crime Rate") +
  ylab("Average Dispensary Density")

Warning: `qplot()` was deprecated in ggplot2 3.4.0.

`geom_smooth()` using formula = 'y ~ x'

Code

summary(model1)


Call:
lm(formula = linear_variables$crime_yoy_avg ~ linear_variables$dsp_yoy_avg, 
    data = linear_variables)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42923 -0.12923 -0.02923  0.07077  1.07077 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)   
(Intercept)                   0.12923    0.03807   3.394   0.0012 **
linear_variables$dsp_yoy_avg -0.07123    0.14326  -0.497   0.6208   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.29 on 62 degrees of freedom
Multiple R-squared:  0.003972,  Adjusted R-squared:  -0.01209 
F-statistic: 0.2472 on 1 and 62 DF,  p-value: 0.6208

Regarding model 1, the shape of the graph reveals:

It appears that Model One is not a good fit for the data based on the negative initial adjusted R-squared value. Other insights I see from these results, include a slope of low magnitude indicating a weak relationship between variables. Additionally, the p-value of 0.62 is not strong enough to reject the null hypothesis at the level of significance (0.05)

because this linear model is so difficult to look at, I’m going to compare the residuals between the two models

Code

ggplot(model1, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.5, size =4, color ="#1679A6" ) +
  geom_hline(yintercept = 0
             , color = "#B41876") +
 labs(title='Model One Residual vs. Fitted Values Plot',
    x = "Fitted Values", y = "Residuals") +
  theme_minimal()

Model Two, adding in explanatory variables:

Race, Ethnicity, Poverty Rate, Employment

Code

#  y = YOY crime at the county level
model2 <- lm(linear_variables$crime_yoy_avg ~ linear_variables$dsp_yoy_avg +
               linear_variables$pct_poverty_rate  +
               linear_variables$pct_black  +
               linear_variables$pct_hispanic  +
               linear_variables$employeed_pop,
                data = linear_variables)

Code

library(ggplot2)


# Create a scatterplot of the residuals vs fitted values
mode2_plot <- ggplot(model2, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.5, size =4, color ="#1679A6" ) +
  geom_hline(yintercept = 0
             , color = "#B41876") +
 labs(title=' Replicated Model: Residual vs. Fitted Values Plot',
    x = "Fitted Values", y = "Residuals") +
  theme_minimal()

ggsave("images/model2.png")

Saving 7 x 5 in image

Code

mode2_plot

Code

summary(model2)


Call:
lm(formula = linear_variables$crime_yoy_avg ~ linear_variables$dsp_yoy_avg + 
    linear_variables$pct_poverty_rate + linear_variables$pct_black + 
    linear_variables$pct_hispanic + linear_variables$employeed_pop, 
    data = linear_variables)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.47400 -0.13238 -0.04318  0.08269  1.00885 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)
(Intercept)                        8.627e-02  1.145e-01   0.753    0.454
linear_variables$dsp_yoy_avg      -6.323e-02  1.543e-01  -0.410    0.683
linear_variables$pct_poverty_rate  6.074e-01  9.582e-01   0.634    0.529
linear_variables$pct_black        -2.071e-01  1.514e+00  -0.137    0.892
linear_variables$pct_hispanic     -1.692e-01  3.680e-01  -0.460    0.647
linear_variables$employeed_pop    -2.045e-07  4.720e-07  -0.433    0.666

Residual standard error: 0.2976 on 58 degrees of freedom
Multiple R-squared:  0.01903,   Adjusted R-squared:  -0.06554 
F-statistic: 0.225 on 5 and 58 DF,  p-value: 0.9502

Interpreting the Results | Mode 2

Model 2 is also not an ideal fit for the data set, and I draw the same conclusions as above. Additionally looking at the residual vs predicted graph, the plots are not evenly distributed vertically, so model two is also not a good fit for this dataset.