| tutorial-id |
none |
131-stops |
| name |
question |
Faran Abbas |
| email |
question |
faranabbas@hotmail.com |
| ID |
question |
Faran Abbas |
| introduction-1 |
question |
Wisdom
Justice
Courage
Temperance |
| introduction-2 |
question |
> show_file(".gitignore")
/.quarto/
stops_files
> |
| introduction-3 |
question |
> show_file("stops.qmd", chunk = "Last")
library(tidyverse)
library(primer.data)
Warning message:
In readLines(path) : incomplete final line found on 'stops.qmd'
> |
| introduction-4 |
question |
> library(tidyverse)
> |
| introduction-5 |
question |
This data is from the Stanford Open Policing Project, which aims to improve police accountability and transparency by providing data on traffic stops across the United States. The New Orleans dataset includes detailed information about traffic stops conducted by the New Orleans Police Department. |
| introduction-6 |
question |
It's the gap between two potential worlds: one where the cause happens, and one where it doesn’t. |
| introduction-7 |
question |
The fundamental problem of causal inference is that we can never observe both potential outcomes for the same unit we can’t see what would have happened if things had been different. |
| introduction-8 |
question |
arrest_made |
| introduction-9 |
question |
A binary, manipulable variable could be officer_bodycam_on (1 = camera on, 0 = camera off), which can be controlled by requiring officers to activate body cameras during all stops. |
| introduction-10 |
question |
There are two potential outcomes for each arrest one if mask = 1 (e.g., bodycam on) and one if mask = 0 (e.g., bodycam off) because each binary treatment implies a different possible arrest outcome. |
| introduction-11 |
question |
Let mask = 1 mean bodycam on, and mask = 0 mean bodycam off; for one driver, suppose the potential outcome is no arrest if mask = 1 and an arrest if mask = 0, so the causal effect is arrest_0 − arrest_1 = 1 − 0 = 1, meaning the bodycam prevented an arrest. |
| introduction-12 |
question |
The variable reason_for_stop likely has an important connection to arrested, as serious violations may increase arrest likelihood. |
| introduction-13 |
question |
Two groups could be Black drivers and White drivers, who might have different average arrest rates during traffic stops. |
| introduction-14 |
question |
Can we predict whether a driver is arrested during a traffic stop based on their race? |
| wisdom-1 |
question |
Exploratory Data Analysis (EDA)
Preceptor Table
Validity |
| wisdom-2 |
question |
Define the causal structure clearly, including potential outcomes and treatment assignments, even if you can’t observe them directly |
| wisdom-3 |
question |
A Preceptor Table is a causal reasoning that lays out, for each unit, the potential outcomes under different values of a manipulable treatment (or covariate).
Units 🧍: The individual entities we're making inferences about—could be people, cities, time points, etc.
Covariates: Attributes or treatments that vary across units and may affect outcomes; one is designated as the treatment and is assumed manipulable.
Potential Outcomes: For each unit, the outcomes under both the treated and untreated condition—even if only one is observed.
Treatment Assignment: Whether the unit actually received treatment or control. |
| wisdom-4 |
question |
The units are individual traffic stops involving specific drivers. |
| wisdom-5 |
question |
The outcome variable is whether an arrest was made during the traffic stop (`arrest_made`). |
| wisdom-6 |
question |
covariate could be the driver's prior criminal record. |
| wisdom-7 |
question |
The treatments are hypothetical manipulations like activating a body camera (bodycam_on = 1) versus not (bodycam_on = 0). |
| wisdom-8 |
question |
The Preceptor Table refers to the moment when the arrest decision is made during a traffic stop. |
| wisdom-9 |
question |
The Preceptor Table shows each traffic stop’s arrest outcome alongside the driver’s race and other covariates to analyze arrest patterns and fairness. |
| wisdom-10 |
question |
Are Black drivers more likely than White drivers to be arrested during traffic stops after controlling for age, gender, reason, location, and time? |
| wisdom-11 |
question |
Arrests during traffic stops can reflect broader patterns of justice and inequality influenced by factors like race. Using data from the City Police Department covering 10,000 stops in 2023, we investigate whether Black drivers face higher arrest rates than White drivers after accounting for age, gender, and stop reasons. |
| justice-1 |
question |
Population Table
Stability
Representativeness
Unconfoundedness |
| justice-2 |
question |
Validity concerns the relationship between the columns in the Preceptor Table and the data |
| justice-3 |
question |
The assumption of validity might fail if the `arrested` column contains errors or omissions, such as arrests that were made but not recorded. |
| justice-4 |
question |
The Population Table includes rows from three sources: the Preceptor Table, the actual data, and all other members of the population. |
| justice-5 |
question |
Each row represents a single traffic stop (unit) at a specific date and time during 2023. |
| justice-6 |
question |
Stability means assuming that the relationships we see in our datalike how treatment affects outcomestay consistent across time, context, and the broader population we're analyzing. |
| justice-7 |
question |
Stability might fail if policing practices or arrest policies changed during 2023, altering arrest probabilities over time. |
| justice-8 |
question |
Representativeness means the data accurately reflects the characteristics of the broader population we want to study or make decisions about. |
| justice-9 |
question |
Representativeness might fail if the data only includes stops from certain neighborhoods or times, missing parts of the overall population. |
| justice-10 |
question |
Representativeness may fail if the Preceptor Table excludes certain stops or groups present in the Population, causing biased inference. |
| justice-11 |
question |
Unconfoundedness means that all variables affecting both the treatment and outcome are measured, so there are no hidden confounders biasing the results. |
| justice-12 |
question |
> library(tidymodels)
── Attaching packages ─────────────────────────────────── tidymodels 1.3.0 ──
✔ broom 1.0.8 ✔ rsample 1.3.0
✔ dials 1.4.0 ✔ tune 1.3.0
✔ infer 1.0.8 ✔ workflows 1.2.0
✔ modeldata 1.4.0 ✔ workflowsets 1.1.1
✔ parsnip 1.3.2 ✔ yardstick 1.3.2
✔ recipes 1.3.1
── Conflicts ────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org |
| justice-13 |
question |
> library(broom)
> |
| justice-14 |
question |
Y = f(X_1, X_2, \ldots, X_p) + \varepsilon |
| justice-15 |
question |
A potential weakness of the model is that unmeasured confounders may bias the estimated effect of race on arrest likelihood. |
| courage-1 |
question |
Courage means committing to a plausible model, testing its limits with transparency, and tracing a believable data-generating story even when uncertainty looms. |
| courage-2 |
exercise |
linear_reg(engine = "lm") |
| courage-3 |
exercise |
linear_reg(engine = "lm")|>
fit(arrested ~ sex, data = x) |
| courage-4 |
exercise |
linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex, data = x) |>
tidy(conf.int = TRUE) |
| courage-5 |
exercise |
linear_reg() |>
set_engine("lm") |>
fit(arrested ~ race, data = x) |
| courage-6 |
exercise |
linear_reg() |>
set_engine("lm") |>
fit(arrested ~ race, data = x) |>
tidy(conf.int = TRUE) |
| courage-7 |
exercise |
linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race, data = x) |>
tidy(conf.int = TRUE) |
| courage-8 |
exercise |
linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race*zone, data = x) |>
tidy(conf.int = TRUE) |
| courage-9 |
exercise |
fit_stops |
| courage-10 |
question |
> x <- stops |>
+ filter(race %in% c("black", "white")) |>
+ mutate(race = str_to_title(race),
+ sex = str_to_title(sex))
+
+ fit_stops <- linear_reg() |>
+ set_engine("lm") |>
+ fit(arrested ~ sex + race*zone, data = x)
> x <- stops |>
+ filter(race %in% c("black", "white")) |>
+ mutate(race = str_to_title(race),
+ sex = str_to_title(sex))
> fit_stops <- linear_reg() |>
+ set_engine("lm") |>
+ fit(arrested ~ sex + race*zone, data = x)
> |
| courage-11 |
question |
> library(easystats)
# Attaching packages: easystats 0.7.5
✔ bayestestR 0.16.1 ✔ correlation 0.8.8
✔ datawizard 1.2.0 ✔ effectsize 1.0.1
✔ insight 1.3.1 ✔ modelbased 0.12.0
✔ performance 0.15.0 ✔ parameters 0.27.0
✔ report 0.6.1 ✔ see 0.11.0
> |
| courage-12 |
question |
> check_predictions(extract_fit_engine(fit_interact))
+
> |
| courage-13 |
question |
\[
\hat{Y} = 0.177
+ 0.061 \cdot \text{Male}
- 0.045 \cdot \text{White}
+ 0.015 \cdot \text{ZoneB}
+ 0.006 \cdot \text{ZoneC}
+ 0.078 \cdot \text{ZoneD}
+ 0.002 \cdot \text{ZoneE}
- 0.003 \cdot \text{ZoneF}
+ 0.031 \cdot \text{ZoneG}
+ 0.076 \cdot \text{ZoneH}
\] |
| courage-14 |
question |
> tutorial.helpers::show_file("stops.qmd", chunk = "Last")
#| cache: true
fit_stops <- linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race + zone, data = x) |
| courage-15 |
question |
> tutorial.helpers::show_file(".gitignore")
/.quarto/
stops_files
*_cache |
| courage-16 |
exercise |
tidy(fit_stops, conf.int = TRUE) |
| courage-17 |
question |
> tutorial.helpers::show_file("stops.qmd", chunk = "Last")
#| label: model-table
#| echo: false
fit_stops |>
tidy(conf.int = TRUE) |>
select(term, estimate, conf.low, conf.high) |>
mutate(across(where(is.numeric), ~ round(., 3))) |>
gt() |>
tab_header(
title = "Model Estimates with 95% Confidence Intervals"
) |>
cols_label(
term = "Term",
estimate = "Estimate",
conf.low = "Lower 95% CI",
conf.high = "Upper 95% CI"
)
> |
| courage-18 |
question |
We model the likelihood of arrest during a traffic stop, a binary outcome, as a linear function of the driver’s sex, race, and the zone in which the stop occurred. |
| temperance-1 |
question |
Temperance in data science is the virtue of restraint: knowing when not to make claims your model can’t support, and when to stop pursuing precision that outstrips your data’s meaning |
| temperance-2 |
question |
The estimate of **0.06 for sexMale** means that, holding race and zone constant, **being male is associated with a 6 percentage point higher probability of arrest** during a traffic stop compared to being female. |
| temperance-3 |
question |
The estimate of -0.04 for raceWhite means that, holding sex and zone constant, being White is associated with a 4 percentage point lower probability of arrest during a traffic stop compared to individuals of other races. |
| temperance-4 |
question |
The intercept estimate of 0.18 means that, for the reference group individuals who are not male, not White, and in zone A the baseline probability of arrest during a traffic stop is approximately 18%. |
| temperance-5 |
question |
> library(marginaleffects)
> |
| temperance-6 |
question |
How do demographic characteristics (like sex and race) and geographic location (zone) affect the probability of being arrested during a traffic stop? |
| temperance-7 |
question |
> predictions(fit_stops)
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
0.183 0.00282 64.8 <0.001 Inf 0.177 0.188
0.135 0.00292 46.3 <0.001 Inf 0.130 0.141
0.246 0.00409 60.1 <0.001 Inf 0.238 0.254
0.135 0.00292 46.3 <0.001 Inf 0.130 0.141
0.262 0.00737 35.6 <0.001 918.5 0.248 0.277
--- 378457 rows omitted. See ?print.marginaleffects ---
0.212 0.00331 64.0 <0.001 Inf 0.205 0.218
0.274 0.00316 86.7 <0.001 Inf 0.267 0.280
0.274 0.00316 86.7 <0.001 Inf 0.267 0.280
0.274 0.00316 86.7 <0.001 Inf 0.267 0.280
0.185 0.00515 35.9 <0.001 936.9 0.175 0.195
Type: numeric
> |
| temperance-8 |
question |
> plot_predictions(fit_stops, by = "sex")
> |
| temperance-10 |
question |
> plot_predictions(fit_stops, condition = c("sex", "race"))
> |
| temperance-11 |
question |
plot_predictions(fit_stops, by = c("race", "sex")) +
labs(
title = "Predicted Probability of Arrest by Race and Sex",
subtitle = "Arrest probabilities vary significantly across race and sex groups",
x = "Group",
y = "Predicted Probability of Arrest",
caption = "Source: City Police Department Traffic Stop Data"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(margin = margin(b = 10)),
plot.caption = element_text(size = 10),
axis.title.y = element_text(margin = margin(r = 10))
) |
| temperance-12 |
question |
> tutorial.helpers::show_file("stops.qmd", chunk = "Last")
library(ggplot2)
library(marginaleffects)
# Generate the predictions object (assuming you already fit the model as `fit_stops`)
preds <- predictions(fit_stops)
# Create the plot
plot_predictions(fit_stops, by = c("race", "sex")) +
labs(
title = "Predicted Probability of Arrest by Race and Sex",
subtitle = "Arrest probabilities vary significantly across race and sex groups",
x = "Group",
y = "Predicted Probability of Arrest",
caption = "Source: City Police Department Traffic Stop Data"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(margin = margin(b = 10)),
plot.caption = element_text(size = 10),
axis.title.y = element_text(margin = margin(r = 10))
)
> |
| temperance-13 |
question |
The model estimates that being male increases the probability of arrest by about 6 percentage points (95% CI: 5.9% to 6.4%), while being White decreases it by about 4.5 percentage points (95% CI: -5.7% to -3.2%), highlighting measurable disparities with quantified uncertainty. |
| temperance-14 |
question |
The estimates might be biased due to unmeasured confounders such as prior offenses or officer discretion, which are not included in the model. Measurement errors in variables like race or arrest recording could also affect accuracy. Additionally, using a linear probability model for a binary outcome may produce predicted probabilities outside the valid range, inflating uncertainty. A logistic regression model could provide more accurate estimates and more realistic confidence intervals for the probabilities. |
| temperance-15 |
question |
> tutorial.helpers::show_file("stops.qmd")
---
title: "Stops"
author: "Faran Abbas"
execute:
echo: false
---
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
library(primer.data)
library(tidymodels)
library(broom)
library(easystats)
library(gt)
library(marginaleffects)
library(ggplot2)
```
$$Y = f(X_1, X_2, \ldots, X_p) + \varepsilon
$$
$$Y \sim \text{Bernoulli}(\rho), \quad \text{with} \quad \rho = f(X_1, X_2, \ldots, X_p)
$$
```{r}
x <- stops |>
filter(race %in% c("black", "white")) |>
mutate(race = str_to_title(race),
sex = str_to_title(sex))
```
```{r}
#| cache: true
fit_stops <- linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race * zone, data = x)
```
```{r}
#| label: model-table
#| echo: false
fit_stops |>
tidy(conf.int = TRUE) |>
select(term, estimate, conf.low, conf.high) |>
mutate(across(where(is.numeric), ~ round(., 3))) |>
gt() |>
tab_header(
title = "Model Estimates with 95% Confidence Intervals"
) |>
cols_label(
term = "Term",
estimate = "Estimate",
conf.low = "Lower 95% CI",
conf.high = "Upper 95% CI"
)
```
```{r}
fit_interact <- linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race * zone, data = x)
fit_interact_tidy <- tidy(fit_interact, conf.int = TRUE)
fit_interact_tidy
```
```{r}
x <- stops |>
filter(race %in% c("black", "white")) |>
mutate(race = str_to_title(race),
sex = str_to_title(sex))
fit_stops <- linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race * zone, data = x)
```
$$
\hat{Y} = 0.177
+ 0.061 \cdot \text{Male}
- 0.045 \cdot \text{White}
+ 0.015 \cdot \text{ZoneB}
+ 0.006 \cdot \text{ZoneC}
+ 0.078 \cdot \text{ZoneD}
+ 0.002 \cdot \text{ZoneE}
- 0.003 \cdot \text{ZoneF}
+ 0.031 \cdot \text{ZoneG}
+ 0.076 \cdot \text{ZoneH}
$$
```{r}
#| cache: true
fit_stops <- linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race + zone, data = x)
```
```{r}
#| cache: true
fit_stops <- linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race * zone, data = x)
```
```{r}
plot_predictions(fit_stops, by = c("race", "sex")) +
labs(
title = "Predicted Probability of Arrest by Race and Sex",
subtitle = "Arrest probabilities vary significantly across race and sex groups",
x = "Group",
y = "Predicted Probability of Arrest",
caption = "Source: City Police Department Traffic Stop Data"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(margin = margin(b = 10)),
plot.caption = element_text(size = 10),
axis.title.y = element_text(margin = margin(r = 10))
)
```
## Summary Paragraph
We model the probability of arrest during a traffic stop, where the outcome is binary (arrest or no arrest), as a linear function of the driver’s sex, race, and the zone in which the stop occurred.
The model assumes additive effects of each covariate, including interaction between race and location.
This structure allows us to estimate how arrest likelihood changes with each predictor while holding others constant.
Although the outcome is binary, a linear model offers a simple baseline to detect potential disparities.
The model estimates that being male increases the probability of arrest by about 6 percentage points (95% CI: 5.9% to 6.4%), while being White decreases it by about 4.5 percentage points (95% CI: -5.7% to -3.2%), highlighting measurable disparities with quantified uncertainty.
The estimates might be biased due to unmeasured confounders such as prior offenses or officer discretion, which are not included in the model. Measurement errors in variables like race or arrest recording could also affect accuracy. Additionally, using a linear probability model for a binary outcome may produce predicted probabilities outside the valid range, inflating uncertainty. A logistic regression model could provide more accurate estimates and more realistic confidence intervals for the probabilities.
Arrests during traffic stops can reflect broader patterns of justice and inequality influenced by factors like race. Using data from the City Police Department covering 10,000 stops in 2023, we investigate whether Black drivers face higher arrest rates than White drivers after accounting for age, gender, and stop reasons.
A potential weakness of the model is that unmeasured confounders may bias the estimated effect of race on arrest likelihood.
Warning message:
In readLines(path) : incomplete final line found on 'stops.qmd' |
| temperance-16 |
question |
https://faranabbas-repo.github.io/stops/ |
| temperance-17 |
question |
https://github.com/faranabbas-repo/stops.git |
| minutes |
question |
240 |