| tutorial-id |
none |
131-stops |
| name |
question |
Jishnu Veerapaneni |
| email |
question |
jishnuvee2035@gmail.com |
| introduction-1 |
question |
Wisdom, Justice, Courage, and Temperance |
| introduction-2 |
question |
> show_file(".gitignore")
stops_files
> |
| introduction-3 |
question |
> show_file("stops.qmd", chunk = "Last")
#| message: false
library(tidyverse)
library(primer.data)
> |
| introduction-4 |
question |
> library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors
Warning messages:
1: package ‘purrr’ was built under R version 4.5.1
2: package ‘dplyr’ was built under R version 4.5.1
> |
| introduction-5 |
question |
This data is from the Stanford Open Policing Project, which aims to improve police accountability and transparency by providing data on traffic stops across the United States. The New Orleans dataset includes detailed information about traffic stops conducted by the New Orleans Police Department. |
| introduction-6 |
question |
A causal effect is the difference between 2 potential outcomes. |
| introduction-7 |
question |
The fundamental problem of causal inference is that you are only able to observe 1 potential outcome at a time. |
| introduction-8 |
question |
Number of arrests |
| introduction-9 |
question |
The imaginary variable can be compliance. We can manpulate this variable by assigning 0 as not being compliant and 1 as being compliant. |
| introduction-10 |
question |
The potential outcomes for each arrest are 2, 1 potential outcome for wearing the mask, and another potential outcome for not wearing the mask. |
| introduction-11 |
question |
The 2 different values for the varialbe mask can be 1 for wearing a mask and 0 for not wearing a mask, the potential outcome for wearing a mask can be arrested and the potential outcome for not wearing a mask is NOT arrested. The causal effect for a unit would be not arrested - arrested. |
| introduction-12 |
question |
Sex |
| introduction-13 |
question |
Two different groups of people could be caucasians and african-americans and the average value for arrested for caucasians might be lower than the average value of arrested for african-americans. |
| introduction-14 |
question |
Do black people get arrested more often than white people? |
| wisdom-1 |
question |
Creating a preceptor table and examining the data. |
| wisdom-2 |
question |
A Preceptor table is a table with the least number of rows and columns necessary to answer our question with no missing data. |
| wisdom-3 |
question |
The key components are the units which are the rows of the table consisting of individual drivers, the outcomes is the arrest variable, and the covariate is race which are other variables/columns necessary to answer our initially proposed question. There is also an ID_driver column. |
| wisdom-4 |
question |
The units for this problem are the individual drivers |
| wisdom-5 |
question |
Arrested |
| wisdom-6 |
question |
Race |
| wisdom-7 |
question |
There are no treatments because this is a predictive model |
| wisdom-8 |
question |
The moment presently studying the data/right now |
| wisdom-9 |
question |
The Preceptor table has the units consisting of the rows of the table consisting of individual drivers, the outcomes is the arrest variable, and the covariate is race which are other variables/columns necessary to answer our initially proposed question. There is also an ID_driver column. |
| wisdom-10 |
question |
Are Black drivers more likely to be arrested than White drivers? |
| wisdom-11 |
question |
We examine traffic stops across the United States and how race may vary in the arrest rate to improve police accountability and transparency. Using data from the Open Policing Project by Standford University, we attempt to see what the difference in arrest rate between Black and White drivers are while adjusting for other covariates. |
| justice-1 |
question |
Creating a population table and considering the role of the assumptions we make of validity, stability, representativeness, and unconfoudedness in our problem. |
| justice-2 |
question |
Validity is the consistency or lack of consistency among the columns in the Preceptor table and the data. |
| justice-3 |
question |
Validity might not hold for the race variable/column as the definition of race might vary for a lot of people and some people might be mixed or might not exactly know their race, resulting in a discrepancy between the preceptor and data table columns. |
| justice-4 |
question |
A Population table consists of unit/time combinations. Both the data and preceptor table are drawn from the population table. |
| justice-5 |
question |
Individual pulled over driver / year |
| justice-6 |
question |
The assumption of stability in data science is the consistency or lack thereof that the columns in the Population Table is the same for all the other rows consisting of the data, the Preceptor Table, and the larger population from which both are drawn. |
| justice-7 |
question |
The assumption of stability might not hold if due to the passage of time new apps or techonology came out that decreases risks of people being pulled over. |
| justice-8 |
question |
The assumption of representativeness consists of the relationships in the population table. The 1st relationship is the Preceptor table and other rows, and second is the data and other rows. |
| justice-9 |
question |
The assumption of representativeness might not be true involving the relationship between data and population if the data isn't representative of the popuation and did not employ a random sampling method. |
| justice-10 |
question |
The preceptor table and population relationship assumption might not be true because if the new laws or norms get more emphazized to handle certain races with more "care" in a certain area or zone compared to other areas, it wouldn't be representative of the population. |
| justice-11 |
question |
The assumption of unconfoundedness is the assumption that the potential outcome does not dictate the treatment being assigned. |
| justice-12 |
question |
library(tidymodels)
── Attaching packages ────────────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom 1.0.8 ✔ recipes 1.3.1
✔ dials 1.4.0 ✔ rsample 1.3.0
✔ dplyr 1.1.4 ✔ tibble 3.3.0
✔ ggplot2 3.5.2 ✔ tidyr 1.3.1
✔ infer 1.0.8 ✔ tune 1.3.0
✔ modeldata 1.4.0 ✔ workflows 1.2.0
✔ parsnip 1.3.2 ✔ workflowsets 1.1.1
✔ purrr 1.1.0 ✔ yardstick 1.3.2
── Conflicts ───────────────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ recipes::step() masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/
Warning messages:
1: package ‘dplyr’ was built under R version 4.5.1
2: package ‘purrr’ was built under R version 4.5.1
> |
| justice-13 |
question |
> library(broom)
> |
| justice-14 |
question |
$$
P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n)}}
$$
with \( Y \sim \text{Bernoulli}(\rho) \) where \( \rho = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n)}} \) |
| justice-15 |
question |
A potential weakness in our model is that the dataset removed almost 3.1 million entries from the real data, reducing it to only about 400,000 entries, this deletion of the entries may lead to a misrepresentation of the population, as the possibility exists that the remaining/current data is only from select areas with selected and specific conditions present, and could be from biased officers who are more likely to arrest drivers compared to other officers in the zone. |
| courage-1 |
question |
Creating a data generating mechanism |
| courage-2 |
exercise |
linear_reg(engine = "lm") |
| courage-3 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex, data = x) |
| courage-4 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex, data = x) |>
tidy(conf.int = TRUE) |
| courage-5 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex, data = x) |
| courage-6 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ race, data = x) |>
tidy(conf.int = TRUE) |
| courage-7 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex + race, data = x) |>
tidy(conf.int = TRUE) |
| courage-8 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex + race*zone, data = x) |>
tidy(conf.int = TRUE) |
| courage-9 |
exercise |
fit_stops |
| courage-10 |
question |
> x <- stops |>
+ filter(race %in% c("black", "white")) |>
+ mutate(race = str_to_title(race),
+ sex = str_to_title(sex))
+
+ fit_stops <- linear_reg() |>
+ set_engine("lm") |>
+ fit(arrested ~ sex + race*zone, data = x)
> |
| courage-11 |
question |
> library(easystats)
# Attaching packages: easystats 0.7.5
✔ bayestestR 0.16.1 ✔ correlation 0.8.8
✔ datawizard 1.2.0 ✔ effectsize 1.0.1
✔ insight 1.3.1 ✔ modelbased 0.12.0
✔ performance 0.15.0 ✔ parameters 0.27.0
✔ report 0.6.1 ✔ see 0.11.0
Warning message:
package ‘easystats’ was built under R version 4.5.1
> |
| courage-12 |
question |
> check_predictions(extract_fit_engine(fit_stops))
> |
| courage-13 |
question |
$$
\widehat{\text{arrested}} = 0.177
+ 0.0614 \cdot \text{sex}_{\text{Male}}
- 0.0445 \cdot \text{race}_{\text{White}}
+ 0.0146 \cdot \text{zone}_{\text{B}}
+ 0.00610 \cdot \text{zone}_{\text{C}}
+ 0.0781 \cdot \text{zone}_{\text{D}}
+ 0.00190 \cdot \text{zone}_{\text{E}}
- 0.00271 \cdot \text{zone}_{\text{F}}
+ 0.0309 \cdot \text{zone}_{\text{G}}
+ 0.0757 \cdot \text{zone}_{\text{H}}
+ \text{(interaction terms for race and zone)}
$$ |
| courage-14 |
question |
> tutorial.helpers::show_file("stops.qmd", chunk = "Last")
#| cache: true
x <- stops |>
filter(race %in% c("black", "white")) |>
mutate(
race = str_to_title(race),
sex = str_to_title(sex),
arrested = factor(arrested, levels = c(0, 1)) # Convert to factor
)
fit_stops <-
logistic_reg() |>
fit(arrested ~ sex + race + zone + sex:race, data = x)
> |
| courage-15 |
question |
> tutorial.helpers::show_file(".gitignore")
stops_files
*_cache
> |
| courage-16 |
exercise |
tidy(fit_stops, conf.int = TRUE) |
| courage-17 |
question |
> tutorial.helpers::show_file("stops.qmd", chunk = "Last")
# Get tidy results with confidence intervals
tidy_table <- tidy(fit_stops, conf.int = TRUE) %>%
select(term, estimate, conf.low, conf.high)
# Make a nice-looking table with gt
tidy_table %>%
gt() %>%
tab_header(
title = "Model Estimates with Confidence Intervals",
subtitle = "Logistic regression on arrest probability"
) %>%
fmt_number(
columns = c(estimate, conf.low, conf.high),
decimals = 3
) %>%
cols_label(
term = "Variable",
estimate = "Estimate",
conf.low = "Lower CI",
conf.high = "Upper CI"
)
> |
| courage-18 |
question |
We model the probability of being arrested during a traffic stop (arrested vs. not arrested) as a logistic function of driver sex, race, and the zone of the stop, including interactions between race and zone. |
| temperance-1 |
question |
Temperance is using the data generating mechanism created in courage to answer our initially proposed question. |
| temperance-2 |
question |
Males typically get pulled over 0.06 more than the average of 0.177 of any other gender. |
| temperance-3 |
question |
White people are 0.04 less likely to get arrested than any other race |
| temperance-4 |
question |
The average chance of getting pulled over for a black woman |
| temperance-5 |
question |
> library(marginaleffects)
Please cite the software developers who make your work possible.
One package: citation("package_name")
All project packages: softbib::softbib()
> |
| temperance-6 |
question |
What is the difference in arrest rate between Black and White drivers adjusting for other variables? |
| temperance-7 |
question |
> predictions(fit_stops$fit)
Estimate Pr(>|z|) S 2.5 % 97.5 %
0.182 <0.001 Inf 0.177 0.187
0.146 <0.001 Inf 0.141 0.152
0.245 <0.001 Inf 0.236 0.254
0.146 <0.001 Inf 0.141 0.152
0.257 <0.001 529.5 0.243 0.272
--- 378457 rows omitted. See ?print.marginaleffects ---
0.209 <0.001 Inf 0.203 0.215
0.275 <0.001 Inf 0.269 0.282
0.275 <0.001 Inf 0.269 0.282
0.275 <0.001 Inf 0.269 0.282
0.185 <0.001 Inf 0.176 0.194
Type: invlink(link)
>
The |
| temperance-8 |
question |
> plot_predictions(fit_stops$fit, by = "sex")
> |
| temperance-9 |
question |
> plot_predictions(fit_stops$fit, condition = "sex")
> |
| temperance-10 |
question |
> plot_predictions(fit_stops$fit, condition = c("sex", "race"))
> |
| temperance-11 |
question |
plot_predictions(fit_stops$fit, condition = c("sex", "race")) +
labs(
title = "Predicted Arrest Probabilities by Sex and Race",
subtitle = "Black drivers show consistently higher arrest probabilities than White drivers across sexes",
caption = "Source: Open Policing Project, Stanford University",
x = "Driver Sex",
y = "Predicted Probability of Arrest"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(margin = margin(b = 10)),
axis.title = element_text(face = "bold"),
legend.title = element_blank()
) |
| temperance-12 |
question |
> tutorial.helpers::show_file("stops.qmd", chunk = "Last")
plot_predictions(fit_stops$fit, condition = c("sex", "race")) +
labs(
title = "Predicted Arrest Probabilities by Sex and Race",
subtitle = "Black drivers show consistently higher arrest probabilities than White drivers across sexes",
caption = "Source: Open Policing Project, Stanford University",
x = "Driver Sex",
y = "Predicted Probability of Arrest"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(margin = margin(b = 10)),
axis.title = element_text(face = "bold"),
legend.title = element_blank()
) |
| temperance-13 |
question |
The predicted probability of arrest for Black male drivers is about 0.33 (95% CI: 0.32–0.34), which is roughly 6 percentage points higher than that for White male drivers in similar zones. |
| temperance-14 |
question |
The estimates may be inaccurate because the dataset excludes over three million stops, which could bias results if the removed cases differ systematically by race or sex. A more conservative alternative might place the probability for Black male drivers closer to 0.30 with a wider 95% CI of 0.28–0.33 to reflect this uncertainty. |
| temperance-15 |
question |
> tutorial.helpers::show_file("stops.qmd")
---
title: "Stops"
format: html
author : "Jishnu Veerapaneni"
execute:
echo: false
---
```{r}
#| message: false
#| warning: false
library(tidyverse)
library(primer.data)
library(tidymodels)
library(broom)
library(easystats)
library(dplyr)
library(broom)
library(gt) # for pretty tables
library(marginaleffects)
```
```{r}
#| cache: true
x <- stops |>
filter(race %in% c("black", "white")) |>
mutate(
race = str_to_title(race),
sex = str_to_title(sex),
arrested = factor(arrested, levels = c(0, 1)) # Convert to factor
)
fit_stops <-
logistic_reg() |>
fit(arrested ~ sex + race + zone + sex:race, data = x)
```
```{r}
plot_predictions(fit_stops$fit, condition = c("sex", "race")) +
labs(
title = "Predicted Arrest Probabilities by Sex and Race",
subtitle = "Black drivers show consistently higher arrest probabilities than White drivers across sexes",
caption = "Source: Open Policing Project, Stanford University",
x = "Driver Sex",
y = "Predicted Probability of Arrest"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(margin = margin(b = 10)),
axis.title = element_text(face = "bold"),
legend.title = element_blank()
)
```
We examine traffic stops across the United States and how race may vary in the arrest rate to improve police accountability and transparency. Using data from the Open Policing Project by Standford University, we attempt to see what the difference in arrest rate between Black and White drivers are while adjusting for other covariates. A potential weakness in our model is that the dataset removed almost 3.1 million entries from the real data, reducing it to only about 400,000 entries, this deletion of the entries may lead to a misrepresentation of the population, as the possibility exists that the remaining/current data is only from select areas with selected and specific conditions present, and could be from biased officers who are more likely to arrest drivers compared to other officers in the zone. We model the probability of being arrested during a traffic stop (arrested vs. not arrested) as a logistic function of driver sex, race, and the zone of the stop, including interactions between race and zone. The predicted probability of arrest for Black male drivers is about 0.33 (95% CI: 0.32–0.34), which is nearly 6 percentage points higher than that for White male drivers in similar zones.
This is my data generating mechanism :
$$
\widehat{\text{arrested}} = 0.177
+ 0.0614 \cdot \text{sex}_{\text{Male}}
- 0.0445 \cdot \text{race}_{\text{White}}
+ 0.0146 \cdot \text{zone}_{\text{B}}
+ 0.00610 \cdot \text{zone}_{\text{C}}
+ 0.0781 \cdot \text{zone}_{\text{D}}
+ 0.00190 \cdot \text{zone}_{\text{E}}
- 0.00271 \cdot \text{zone}_{\text{F}}
+ 0.0309 \cdot \text{zone}_{\text{G}}
+ 0.0757 \cdot \text{zone}_{\text{H}}
+ \text{(interaction terms for race and zone)}
$$
```{r}
# Get tidy results with confidence intervals
tidy_table <- tidy(fit_stops$fit, conf.int = TRUE) |>
select(term, estimate, conf.low, conf.high)
# Make a nice-looking table with gt
tidy_table %>%
gt() %>%
tab_header(
title = "Model Estimates with Confidence Intervals",
subtitle = "Logistic regression on arrest probability"
) %>%
fmt_number(
columns = c(estimate, conf.low, conf.high),
decimals = 3
) %>%
cols_label(
term = "Variable",
estimate = "Estimate",
conf.low = "Lower CI",
conf.high = "Upper CI"
)
```
> |
| temperance-16 |
question |
https://jishnuvee.github.io/stops/ |
| temperance-17 |
question |
https://github.com/jishnuvee/stops |
| minutes |
question |
120 |