| tutorial-id |
none |
131-stops |
| name |
question |
Faisal Jan |
| email |
question |
faisaljan6848@gmail.com |
| introduction-1 |
question |
Wisdom, Justice, Courage, and Temperance. |
| introduction-2 |
question |
> show_file(".gitignore")
stops_files
> |
| introduction-3 |
question |
> show_file("stops.qmd", chunk = "Last")
#| message: false
library(tidyverse)
library(primer.data)
> |
| introduction-4 |
question |
> library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors
> |
| introduction-5 |
question |
This data is from the Stanford Open Policing Project, which aims to improve police accountability and transparency by providing data on traffic stops across the United States. The New Orleans dataset includes detailed information about traffic stops conducted by the New Orleans Police Department. |
| introduction-6 |
question |
A causal effect is the difference between two potential outcomes. |
| introduction-7 |
question |
The fundamental problem of causal inference is that we can only observe one potential outcome, and others will always be missing. |
| introduction-8 |
question |
We will use arrested variable as our outcome. |
| introduction-9 |
question |
We can use a training variable which represents whether the driver took any professional training of driving or not. It will two values: 1 if the driver was trained and 0 if he did not had any professional training. The variable is also manipulable, because people can enroll in a training or not. |
| introduction-10 |
question |
There are two potential outcomes as mask variable has two possible values. The outcome when driver was wearing a mask and the other would be when she was not wearing any mask. |
| introduction-11 |
question |
Assume the values for mask to be 1 if person wearing a mask and 0 if not. If the mask gets 1, the driver was arrested. If it was 0, then person did not get arrested. The causal effect is determined by the difference of wearing a mask and getting arrested and not wearing one and not getting arrested. |
| introduction-12 |
question |
age can be used is an important variable to examine the proportion of people getting arrest by their age. |
| introduction-13 |
question |
I believe, people of race 'Black' would have different arrest rates than people with White race. |
| introduction-14 |
question |
What is the difference in the likelihood of being arrested based on a driver's race? |
| wisdom-1 |
question |
Wisdom requires a question to begin. It involves the creation of a Preceptor Table and an examination of our data. |
| wisdom-2 |
question |
A Preceptor Table has the fewest rows and columns that, if completely filled with accurate data, would make it easier to calculate the quantity of interests. |
| wisdom-3 |
question |
The Preceptor Table has rows referred as units. It also has an outcome column, which is the quantity of interest. If the problem is causal, there will be at least two potential outcomes. Other columns are the covariates required to answer the question. For a causal model, one of the covariates is a treatment. |
| wisdom-4 |
question |
Individual drivers. |
| wisdom-5 |
question |
arrested is the outcome, which represents whether a person was arrested or not. |
| wisdom-6 |
question |
race is the main covariate. Other variables such as age and sex can also be used. |
| wisdom-7 |
question |
There is no treatment, as this is a predictive problem. |
| wisdom-8 |
question |
It refers to the time when people have been pulled over. |
| wisdom-9 |
question |
The Preceptor Table will include people who were stopped in rows. In addition to an ID column, it will have arrested as outcome to represents whether the person was arrested or not. Additionally, the table will have race, age, sex, and zone columns as covariates. |
| wisdom-10 |
question |
Does the difference in arrest rates between Black and White drivers vary by location or time of day? |
| wisdom-11 |
question |
Fair and unbiased decision-making in law enforcement is crucial, especially when outcomes like arrests may be influenced by personal characteristics such as race. Using data from about 400,000 traffic stops in New Orleans and compiled by the Stanford Open Policing Project, we seek to examine the difference in arrest rates between Black and White drivers vary by location or time of day. |
| justice-1 |
question |
Justice concerns to the Population Table and four key assumptions which underlie it; validity, stability, representativeness, and unconfoundedness. |
| justice-2 |
question |
Validity is the consistency between columns in both Preceptor Table and our dataset. The columns in both source should be same not by their names only but in meaning as well. |
| justice-3 |
question |
One reason the assumption of validity might not hold is that the "arrested" column may contain inaccuracies due to inconsistent or incomplete reporting by officers during traffic stops. If some arrests were not properly recorded or were misclassified in the data, this could compromise the reliability of the outcome variable. |
| justice-4 |
question |
A Population Table is constructed on top of population from which Preceptor Table and our dataset are drawn. It includes combination of unit/time for rows in the table. |
| justice-5 |
question |
Each row in the Population Table represents a person who can drive in the United States, as a unit. While the time is represented by the time of the stop or we can combine both time and date. |
| justice-6 |
question |
Stability refers to the relationship columns in Population Table for all rows in data set, Preceptor Table, and other rows of larger population. |
| justice-7 |
question |
Data is collected at different time across many years, during the period the traffic rules and regulations might change as well. If any change occurs in law might lead to instability of data. |
| justice-8 |
question |
Representativeness concerns with two relationships among the rows in Population Table. First, between dataset and other rows. Second, between other rows and Preceptor Table. |
| justice-9 |
question |
Our dataset contains only information about stops in New Orleans which might not representative of the entire United States. |
| justice-10 |
question |
Our Preceptor Table might not be representative to another state, because it only has data of New Orleans. |
| justice-11 |
question |
Unconfoundedness refers to assignment of treatment to be independent of the potential outcomes, when we condition on pre-treatment covariates. |
| justice-12 |
question |
> library(tidymodels)
── Attaching packages ────────────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom 1.0.8 ✔ rsample 1.3.0
✔ dials 1.4.0 ✔ tune 1.3.0
✔ infer 1.0.8 ✔ workflows 1.2.0
✔ modeldata 1.4.0 ✔ workflowsets 1.1.1
✔ parsnip 1.3.2 ✔ yardstick 1.3.2
✔ recipes 1.3.1
── Conflicts ───────────────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/
> |
| justice-13 |
question |
> library(broom)
> |
| justice-14 |
question |
$$
P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_n X_n)}}
$$
with $Y \sim \text{Bernoulli}(\rho)$ where $\rho = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_n X_n)}}$ |
| justice-15 |
question |
This data is a sample of entries from the real data, which is nearly 3.1 million records. |
| courage-1 |
question |
Courage concerns with the creation of the data generating mechanism. |
| courage-2 |
exercise |
linear_reg(engine = "lm") |
| courage-3 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex, data = x) |
| courage-4 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex, data = x) |>
tidy(conf.int = TRUE) |
| courage-5 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ race, data = x) |
| courage-6 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ race, data = x) |>
tidy(conf.int = TRUE) |
| courage-7 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex + race, data = x) |
| courage-8 |
exercise |
linear_reg(engine = "lm") |>
fit(arrested ~ sex + race*zone, data = x) |
| courage-9 |
exercise |
fit_stops |
| courage-10 |
question |
> x <- stops |>
+ filter(race %in% c("black", "white")) |>
+ mutate(race = str_to_title(race),
+ sex = str_to_title(sex))
+
+ fit_stops <- linear_reg() |>
+ set_engine("lm") |>
+ fit(arrested ~ sex + race*zone, data = x)
> |
| courage-11 |
question |
> library(easystats)
# Attaching packages: easystats 0.7.5
✔ bayestestR 0.16.1 ✔ correlation 0.8.8
✔ datawizard 1.2.0 ✔ effectsize 1.0.1
✔ insight 1.3.1 ✔ modelbased 0.12.0
✔ performance 0.15.0 ✔ parameters 0.27.0
✔ report 0.6.1 ✔ see 0.11.0
Warning message:
package ‘easystats’ was built under R version 4.5.1
> |
| courage-12 |
question |
> check_predictions(extract_fit_engine(fit_stops))
> |
| courage-13 |
question |
\begin{align*}
\hat{\text{arrested}} =\ & 0.177
+ 0.0614 \cdot \text{sexMale}
- 0.0445 \cdot \text{raceWhite} \\
&+ 0.0146 \cdot \text{zoneB}
+ 0.00610 \cdot \text{zoneC}
+ 0.0781 \cdot \text{zoneD} \\
&+ 0.00190 \cdot \text{zoneE}
- 0.00271 \cdot \text{zoneF}
+ 0.0309 \cdot \text{zoneG}
+ 0.0757 \cdot \text{zoneH}
\end{align*} |
| courage-14 |
question |
> tutorial.helpers::show_file("stops.qmd", chunk = "Last")
#| cache: true
x <- stops |>
filter(race %in% c("black", "white")) |>
mutate(race = str_to_title(race),
sex = str_to_title(sex))
fit_stops <- linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race*zone, data = x)
> |
| courage-15 |
question |
> tutorial.helpers::show_file(".gitignore")
stops_files
*_cache
> |
| courage-16 |
exercise |
fit_stops |> tidy(conf.int = TRUE) |
| courage-17 |
question |
> tutorial.helpers::show_file("stops.qmd", chunk = "Last")
model_tidy <- fit_stops |> tidy(conf.int = TRUE)
# Generate the LaTeX table
model_tidy |>
select(term, estimate, conf.low, conf.high) |>
mutate(across(where(is.numeric), ~ round(.x, 4))) |>
gt() |>
cols_label(
term = "Variable",
estimate = "Estimate",
conf.low = "95% CI Lower",
conf.high = "95% CI Upper"
) |>
tab_header(
title = "Regression Estimates for Arrest Model"
) |>
fmt_number(
columns = c(estimate, conf.low, conf.high),
decimals = 4
) |>
opt_table_font(
font = list(gt::google_font("Lato"), "sans-serif")
)
> |
| courage-18 |
question |
We model arrested, as a linear function of sex and by the interaction between race and zone. |
| temperance-1 |
question |
Temperance uses the data generating mechanism which we created. Humility reminds us that any answer it produces is ultimately a lie. |
| temperance-2 |
question |
The point estimate 0.06 for sexMale represents the probability of arrest rate among male, which is 0.06 higher than female drivers. This means, men are more likely to get arrested than women. |
| temperance-3 |
question |
When comparing White and Black drivers, White have -0.04 less value for arrested. It means White drivers are less likely to get arrested, relative to Black drivers, adjusting for other variables in the model. |
| temperance-4 |
question |
The 0.18 estimate for Intercept represents the value of arrested for Black women drivers. |
| temperance-5 |
question |
> library(marginaleffects)
Please cite the software developers who make your work possible.
One package: citation("package_name")
All project packages: softbib::softbib()
> |
| temperance-6 |
question |
What is the difference in arrest rate between Black and White drivers adjusting for other variables? |
| temperance-7 |
question |
> predictions(fit_stops)
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
0.179 0.00343 52.2 <0.001 Inf 0.173 0.186
0.142 0.00419 33.8 <0.001 828.0 0.133 0.150
0.250 0.00451 55.5 <0.001 Inf 0.241 0.259
0.142 0.00419 33.8 <0.001 828.0 0.133 0.150
0.232 0.01776 13.1 <0.001 127.6 0.198 0.267
--- 378457 rows omitted. See ?print.marginaleffects ---
0.208 0.00390 53.4 <0.001 Inf 0.201 0.216
0.270 0.00377 71.5 <0.001 Inf 0.262 0.277
0.270 0.00377 71.5 <0.001 Inf 0.262 0.277
0.270 0.00377 71.5 <0.001 Inf 0.262 0.277
0.189 0.00545 34.7 <0.001 874.0 0.179 0.200
Type: numeric
> |
| temperance-8 |
question |
> plot_predictions(fit_stops, by = "sex")
> |
| temperance-9 |
question |
> plot_predictions(fit_stops, condition = "sex")
> |
| temperance-10 |
question |
> plot_predictions(fit_stops, condition = c("sex", "race"))
> |
| temperance-11 |
question |
plot_predictions(fit_stops, condition = c("sex", "race"), draw = FALSE) |>
mutate(
group = paste(sex, race),
group = factor(
group,
levels = c("Female Black", "Female White", "Male Black", "Male White")
)
) |>
ggplot(aes(x = group, y = estimate, fill = race)) +
geom_col(width = 0.6, color = "black") +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
labs(
title = "Predicted Probability of Arrest by Race and Sex",
subtitle = "Black drivers—especially men—face higher predicted arrest rates during traffic stops",
caption = "Source: Stanford Open Policing Project (New Orleans Data)",
x = "Driver Group",
y = "Predicted Probability of Arrest"
) +
scale_fill_manual(values = c("Black" = "#1f77b4", "White" = "#ff7f0e")) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 18),
plot.subtitle = element_text(size = 12, margin = margin(b = 10)),
axis.title.y = element_text(margin = margin(r = 10)),
axis.title.x = element_text(margin = margin(t = 10)),
legend.position = "none"
) |
| temperance-12 |
question |
> tutorial.helpers::show_file("stops.qmd", chunk = "Last")
# Plot
plot_predictions(fit_stops, condition = c("sex", "race"), draw = FALSE) |>
mutate(
group = paste(sex, race),
group = factor(
group,
levels = c("Female Black", "Female White", "Male Black", "Male White")
)
) |>
ggplot(aes(x = group, y = estimate, fill = race)) +
geom_col(width = 0.6, color = "black") +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
labs(
title = "Predicted Probability of Arrest by Race and Sex",
subtitle = "Black drivers—especially men—face higher predicted arrest rates during traffic stops",
caption = "Source: Stanford Open Policing Project (New Orleans Data)",
x = "Driver Group",
y = "Predicted Probability of Arrest"
) +
scale_fill_manual(values = c("Black" = "#1f77b4", "White" = "#ff7f0e")) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 18),
plot.subtitle = element_text(size = 12, margin = margin(b = 10)),
axis.title.y = element_text(margin = margin(r = 10)),
axis.title.x = element_text(margin = margin(t = 10)),
legend.position = "none"
)
> |
| temperance-13 |
question |
In Zone W, the predicted probability of arrest for Black male drivers is approximately 35.3% (95% CI: 33.5% to 37.1%), compared to just 28.1% (95% CI: 26.3% to 29.9%) for White male drivers, making it the zone with the highest overall arrest likelihood and one of the largest racial disparities. |
| temperance-14 |
question |
Because, we are using data for a specific state (New Orleans), which does not guarantee to be same across other states. Other factors like age and education level also can be used for this consideration. |
| temperance-15 |
question |
> tutorial.helpers::show_file("stops.qmd")
---
title: "Stops"
author: "Faisal Jan"
format: html
execute:
echo: false
---
```{r}
#| message: false
library(tidyverse)
library(primer.data)
library(tidymodels)
library(broom)
library(gt)
library(marginaleffects)
```
```{r}
#| cache: true
x <- stops |>
filter(race %in% c("black", "white")) |>
mutate(race = str_to_title(race), sex = str_to_title(sex))
fit_stops <- linear_reg() |>
set_engine("lm") |>
fit(arrested ~ sex + race * zone, data = x)
```
```{r}
# Plot
plot_predictions(fit_stops, condition = c("sex", "race"), draw = FALSE) |>
mutate(
group = paste(sex, race),
group = factor(
group,
levels = c("Female Black", "Female White", "Male Black", "Male White")
)
) |>
ggplot(aes(x = group, y = estimate, fill = race)) +
geom_col(width = 0.6, color = "black") +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
labs(
title = "Predicted Probability of Arrest by Race and Sex",
subtitle = "Black drivers—especially men—face higher predicted arrest rates during traffic stops",
caption = "Source: Stanford Open Policing Project (New Orleans Data)",
x = "Driver Group",
y = "Predicted Probability of Arrest"
) +
scale_fill_manual(values = c("Black" = "#1f77b4", "White" = "#ff7f0e")) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 18),
plot.subtitle = element_text(size = 12, margin = margin(b = 10)),
axis.title.y = element_text(margin = margin(r = 10)),
axis.title.x = element_text(margin = margin(t = 10)),
legend.position = "none"
)
```
Fair and unbiased decision-making in law enforcement is crucial, especially when outcomes like arrests may be influenced by personal characteristics such as race. Using data from about 400,000 traffic stops in New Orleans and compiled by the Stanford Open Policing Project, we seek to examine the difference in arrest rates between Black and White drivers vary by location or time of day. This sample of entries from the real data, which is nearly 3.1 million records, may not fully represent the population. In Zone W, the predicted probability of arrest for Black male drivers is approximately 35.3% (95% CI: 33.5% to 37.1%), compared to just 28.1% (95% CI: 26.3% to 29.9%) for White male drivers, making it the zone with the highest overall arrest likelihood and one of the largest racial disparities.
$$
P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_n X_n)}}
$$
with $Y \sim \text{Bernoulli}(\rho)$ where $\rho = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_n X_n)}}$
<!-- Fitted Model -->
\begin{align*}
\hat{\text{arrested}} =\ & 0.177
+ 0.0614 \cdot \text{sexMale}
- 0.0445 \cdot \text{raceWhite} \\
&+ 0.0146 \cdot \text{zoneB}
+ 0.00610 \cdot \text{zoneC}
+ 0.0781 \cdot \text{zoneD} \\
&+ 0.00190 \cdot \text{zoneE}
- 0.00271 \cdot \text{zoneF}
+ 0.0309 \cdot \text{zoneG}
+ 0.0757 \cdot \text{zoneH}
\end{align*}
```{r}
model_tidy <- fit_stops |> tidy(conf.int = TRUE)
# Generate the LaTeX table
model_tidy |>
select(term, estimate, conf.low, conf.high) |>
mutate(across(where(is.numeric), ~ round(.x, 4))) |>
gt() |>
cols_label(
term = "Variable",
estimate = "Estimate",
conf.low = "95% CI Lower",
conf.high = "95% CI Upper"
) |>
tab_header(
title = "Regression Estimates for Arrest Model"
) |>
fmt_number(
columns = c(estimate, conf.low, conf.high),
decimals = 4
) |>
opt_table_font(
font = list(gt::google_font("Lato"), "sans-serif")
)
```
> |
| temperance-16 |
question |
https://faisaljanbaloch.github.io/stops/ |
| temperance-17 |
question |
https://github.com/FaisaljanBaloch/stops |
| minutes |
question |
188 |