| tutorial-id |
none |
stops |
| name |
question |
Hassan Ali |
| email |
question |
hassan.alisoni007@gmail.com |
| introduction-1 |
question |
wisdom, justice, courage, temperance |
| the-question-1 |
exercise |
library(tidyverse) |
| the-question-2 |
exercise |
library(primer.data) |
| the-question-3 |
question |
New Orleans Traffic Stops Data
Description
This data is from the Stanford Open Policing Project, which aims to improve police accountability and transparency by providing data on traffic stops across the United States. The New Orleans dataset includes detailed information about traffic stops conducted by the New Orleans Police Department. |
| the-question-4 |
question |
arrest_made |
| the-question-5 |
question |
You could create a binary variable like body_camera_on (TRUE/FALSE), representing whether the officer’s body camera was active. This can be manipulated by implementing or removing a policy requiring cameras, allowing comparison of outcomes such as arrest rates under both conditions. |
| the-question-6 |
question |
There are two potential outcomes for each arrest: one if the driver wears a mask (mask = TRUE) and one if they do not (mask = FALSE). This is because, in causal inference, we consider what would happen under both conditions, even though only one is observed for each individual. |
| the-question-7 |
question |
For a single driver, the treatment variable mask can take two values: mask = TRUE (driver wears a mask) and mask = FALSE (driver does not wear a mask). Suppose if masked, the driver would not be arrested (Y(1) = 0), and if unmasked, the driver would be arrested (Y(0) = 1). The causal effect for this unit is Y(1) - Y(0) = 0 - 1 = -1, indicating wearing a mask reduces the chance of arrest. |
| the-question-8 |
question |
A variable like search_conducted (whether a search was performed during the stop) is likely to have a strong connection to arrested, as searches often correlate with higher chances of arrest. |
| the-question-9 |
question |
Two groups could be Black drivers and White drivers, as racial disparities in policing may lead to different average arrest rates between these groups. |
| the-question-10 |
question |
A suitable predictive question is, “Can a driver’s race be used to predict the likelihood of being arrested during a traffic stop?” |
| wisdom-1 |
question |
Wisdom in data science means recognizing the limits of what the data can tell us, questioning assumptions, and avoiding overconfidence in models. It involves carefully considering context, potential biases, and confounding factors to draw meaningful and responsible conclusions. |
| wisdom-2 |
question |
A Preceptor Table is a simple table that compares the average outcome (e.g., arrest rate) across groups defined by a key variable, such as race, without adjusting for other factors. It provides an initial, unadjusted view of group differences. |
| wisdom-3 |
question |
Preceptor Tables summarize data by grouping units based on a variable of interest, showing the average outcomes for each group. They do not adjust for other covariates, providing a raw comparison that helps identify initial patterns or differences between groups. |
| wisdom-4 |
question |
> tutorial.helpers::show_file(".gitignore")
*Rproj
> |
| wisdom-5 |
question |
The units for this problem are individual traffic stops recorded in the dataset. |
| wisdom-6 |
question |
The outcome variable for this problem is arrested, indicating whether an arrest was made during the traffic stop. |
| wisdom-7 |
question |
Useful covariates could include the driver’s age, gender, location of the stop, time of day, reason for the stop, and whether a search was conducted, as these factors may influence the likelihood of arrest. |
| wisdom-8 |
question |
For this problem, the treatment can be considered the driver’s race (e.g., Black vs. White), as we are examining how differences in this variable relate to arrest outcomes. |
| wisdom-9 |
question |
The Preceptor Table refers to the moment of each traffic stop, as it uses data captured at the time the stop and its outcome occurred. |
| wisdom-10 |
question |
A causal effect is the difference in an outcome that would occur for the same unit under two different treatment conditions, showing how changing the treatment directly causes a change in the outcome. |
| wisdom-11 |
question |
The fundamental problem of causal inference is that we can observe only one outcome for each unit—either under treatment or control—not both, making it impossible to directly measure the true causal effect for that unit. |
| wisdom-12 |
question |
The motto applies because we cannot manipulate a driver’s race, meaning any differences in arrest rates between races may be due to confounding factors, not a clear causal effect. |
| wisdom-13 |
question |
The Preceptor Table for this problem would compare the average arrest rates between groups of drivers with different races (e.g., Black vs. White) without adjusting for other factors, giving an unadjusted view of disparities in arrests. |
| wisdom-14 |
question |
> tutorial.helpers::show_file("stops.qmd", start = -5)
```{r}
# | message: false
library(tifyverse)
library(primer.data)
```
> |
| wisdom-15 |
question |
Validity refers to how well a study or analysis measures what it is intended to measure, meaning the results accurately reflect the true relationship between variables without being distorted by bias or confounding factors. |
| wisdom-16 |
question |
The assumption of validity might not hold if the columns in the dataset contain errors or missing values, such as misreported arrests or incorrect recording of race, leading to biased results. |
| wisdom-17 |
question |
Arrest rates during traffic stops can vary across different demographic groups, influenced by factors such as race and other driver characteristics. Using data from the Stanford Open Policing Project, which includes over 400,000 traffic stops in New Orleans from 2011 to 2018, we examine whether race predicts the likelihood of being arrested. |
| wisdom-21 |
question |
> tutorial.helpers::show_file("stops.qmd", start = -5)
x <- stops |>
filter(race %in% c("black", "white")) |>
mutate(race = str_to_title(race),
sex = str_to_title(sex))
```
Warning message:
In readLines(path) : incomplete final line found on 'stops.qmd'
> |
| justice-1 |
question |
The four key components of Justice in data science are fairness (ensuring models do not discriminate), transparency (clearly explaining methods and decisions), accountability (taking responsibility for outcomes), and equity (addressing biases to promote fair treatment across all groups). |
| justice-2 |
question |
A Population Table summarizes the outcome variable across the entire population, often broken down by groups of interest, and adjusts for other covariates to give a clearer picture of differences while reducing bias. |
| justice-3 |
question |
Stability means that the relationships observed in the data remain consistent over time or across different settings, so the model’s conclusions would still hold if applied to new or slightly different data. |
| justice-4 |
question |
The assumption of stability might not hold because policing practices or policies in New Orleans could have changed over time, meaning the relationship between race and arrest rates may differ across years in the dataset. |
| justice-5 |
question |
Representativeness means the data used in the analysis accurately reflects the overall population being studied, so conclusions drawn from the sample can be generalized without significant bias. |
| justice-6 |
question |
The assumption of representativeness might fail if the dataset systematically excludes certain stops or groups, such as unreported incidents, causing the data to differ from the true population of all traffic stops. |
| justice-7 |
question |
Representativeness might not hold if the population itself differs from the groups being compared in the Preceptor Table, for example, if the racial composition or arrest practices in New Orleans are not reflective of other regions. |
| justice-8 |
question |
Unconfoundedness means that, after accounting for all relevant covariates, there are no hidden factors influencing both the treatment and the outcome, allowing us to isolate the true effect of the treatment. |
| justice-9 |
question |
So far, I have used traffic stop data from the Stanford Open Policing Project, covering over 400,000 stops in New Orleans, to explore whether race predicts the likelihood of arrest. The analysis focuses on differences in arrest rates between Black and White drivers while considering other factors. However, unmeasured confounders, such as officer discretion or unrecorded stop details, may bias the results. |
| courage-1 |
question |
The virtue of Courage in data analysis involves being willing to confront uncomfortable findings, question assumptions, and report results honestly, even if they challenge expectations or reveal inconvenient truths. |
| courage-2 |
exercise |
library(tidymodels) |
| courage-3 |
exercise |
library(broom) |
| courage-4 |
question |
$$
P(Y = 1 \mid X) = \rho = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k)}}
$$ |
| courage-5 |
question |
> tutorial.helpers::show_file("stops.qmd", pattern = "library")
library(tidyverse)
library(primer.data)
> |
| courage-6 |
exercise |
linear_reg(engine = "lm") |
| courage-7 |
exercise |
linear_reg(engine = "lm") %>%
fit(arrested ~ sex + race * zone, data = x) |
| courage-8 |
exercise |
fit_stops |
| courage-9 |
question |
> tutorial.helpers::show_file("stops.qmd", start = -8)
```{r}
#| label: model
#| cache: true
fit_stops <- linear_reg(engine = "lm") %>%
fit(arrested ~ sex + race * zone, data = x)
```
> |
| courage-10 |
question |
> tutorial.helpers::show_file("stops.qmd", pattern = "extract")
extract_eq(fit_stops$fit,
> |
| courage-11 |
exercise |
library(broom)
tidy(fit_stops, conf.int = TRUE) |
| courage-12 |
question |
The estimate of 0.06 for sexMale means that, holding race and zone constant, male drivers have an arrest rate about 6 percentage points higher than female drivers. |
| courage-13 |
question |
The estimate for Zone D is 0.078, indicating a higher arrest rate than Zone F, which has an estimate of -0.003. Since the intercept (females’ average) is 0.18, the expected arrest rate in Zone F (0.18 − 0.003 ≈ 17.7%) is slightly lower than the female average, while Zone D’s rate (0.18 + 0.078 ≈ 25.8%) is higher. |
| courage-14 |
question |
The estimate of -0.04 for raceWhite means that, on average and holding other factors constant, White drivers have an arrest rate about 4 percentage points lower than drivers of the reference race group (non-White). |
| courage-16 |
question |
I am using a linear regression model with arrested as the dependent variable and predictors including sex, race, and their interaction with zones. The results show that being male is positively associated with a higher likelihood of arrest. |
| temperance-1 |
question |
Temperance in data science means practicing restraint by not over-interpreting results, avoiding overstated conclusions, and being careful to report only what the data truly supports while acknowledging its limitations. |
| temperance-2 |
exercise |
library(marginaleffects) |
| temperance-3 |
exercise |
library(tidyext) |
| temperance-4 |
question |
The general topic is how demographic factors influence arrest rates during traffic stops. The specific question is whether race, particularly being Black or White, affects the likelihood of being arrested, after adjusting for other variables. |
| temperance-5 |
question |
The object ndata should include the columns used in the model: sex, race, and zone, since these variables are required to generate predictions with the marginaleffects functions. |
| temperance-6 |
question |
The ndata object should include rows representing combinations of interest, such as: one row with a Black female in a reference zone, another with a White female in the same zone, and similar pairs for other zones. Including multiple combinations of sex, race, and zone allows comparison of predicted arrest rates across groups and helps answer the original question more fully. |
| temperance-7 |
question |
> tutorial.helpers::show_file("stops.qmd", pattern = "marginaleffects|tidytext")
library(tidytext)
library(marginaleffects)
> |
| temperance-8 |
exercise |
plot_predictions(fit_stops$fit,
newdata = "balanced",
condition = c("zone", "race", "sex")) |
| temperance-9 |
exercise |
plot_predictions(fit_stops$fit,
newdata = "balanced",
condition = c("zone", "race", "sex"),
draw = FALSE) %>%
as_tibble() |
| temperance-10 |
exercise |
plot_predictions(fit_stops$fit,
newdata = "balanced",
condition = c("zone", "race", "sex"),
draw = FALSE) %>%
as_tibble() %>%
group_by(zone, sex) %>%
mutate(sort_order = estimate[race == "Black"]) %>%
ungroup() |
| temperance-11 |
exercise |
plot_predictions(fit_stops$fit,
newdata = "balanced",
condition = c("zone", "race", "sex"),
draw = FALSE) %>%
as_tibble() %>%
group_by(zone, sex) %>%
mutate(sort_order = estimate[race == "Black"]) %>%
ungroup() %>%
mutate(zone = reorder_within(zone, sort_order, sex)) |
| temperance-12 |
exercise |
plot_predictions(fit_stops$fit,
newdata = "balanced",
condition = c("zone", "race", "sex"),
draw = FALSE) %>%
as_tibble() %>%
group_by(zone, sex) %>%
mutate(sort_order = estimate[race == "Black"]) %>%
ungroup() %>%
mutate(zone = reorder_within(zone, sort_order, sex)) %>%
ggplot(aes(x = zone, color = race)) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
width = 0.2,
position = position_dodge(width = 0.5)) |
| temperance-13 |
exercise |
plot_predictions(fit_stops$fit,
newdata = "balanced",
condition = c("zone", "race", "sex"),
draw = FALSE) %>%
as_tibble() %>%
group_by(zone, sex) %>%
mutate(sort_order = estimate[race == "Black"]) %>%
ungroup() %>%
mutate(zone = reorder_within(zone, sort_order, sex)) %>%
ggplot(aes(x = zone, color = race)) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
width = 0.2,
position = position_dodge(width = 0.5)) +
geom_point(aes(y = estimate),
size = 1,
position = position_dodge(width = 0.5)) |
| temperance-14 |
exercise |
plot_predictions(fit_stops$fit,
newdata = "balanced",
condition = c("zone", "race", "sex"),
draw = FALSE) %>%
as_tibble() %>%
group_by(zone, sex) %>%
mutate(sort_order = estimate[race == "Black"]) %>%
ungroup() %>%
mutate(zone = reorder_within(zone, sort_order, sex)) %>%
ggplot(aes(x = zone, color = race)) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
width = 0.2,
position = position_dodge(width = 0.5)) +
geom_point(aes(y = estimate),
size = 1,
position = position_dodge(width = 0.5)) +
facet_wrap(~sex, scales = "free_x") |
| temperance-15 |
exercise |
plot_predictions(fit_stops$fit,
newdata = "balanced",
condition = c("zone", "race", "sex"),
draw = FALSE) %>%
as_tibble() %>%
group_by(zone, sex) %>%
mutate(sort_order = estimate[race == "Black"]) %>%
ungroup() %>%
mutate(zone = reorder_within(zone, sort_order, sex)) %>%
ggplot(aes(x = zone, color = race)) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
width = 0.2,
position = position_dodge(width = 0.5)) +
geom_point(aes(y = estimate),
size = 1,
position = position_dodge(width = 0.5)) +
facet_wrap(~sex, scales = "free_x") +
scale_x_reordered() +
theme(axis.text.x = element_text(size = 8)) |
| temperance-16 |
exercise |
library(scales)
plot_predictions(fit_stops$fit,
newdata = "balanced",
condition = c("zone", "race", "sex"),
draw = FALSE) %>%
as_tibble() %>%
group_by(zone, sex) %>%
mutate(sort_order = estimate[race == "Black"]) %>%
ungroup() %>%
mutate(zone = reorder_within(zone, sort_order, sex)) %>%
ggplot(aes(x = zone, color = race)) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
width = 0.2,
position = position_dodge(width = 0.5)) +
geom_point(aes(y = estimate),
size = 1,
position = position_dodge(width = 0.5)) +
facet_wrap(~sex, scales = "free_x") +
scale_x_reordered() +
theme(axis.text.x = element_text(size = 8)) +
scale_y_continuous(labels = percent_format()) |
| temperance-17 |
exercise |
library(scales)
plot_predictions(fit_stops$fit,
newdata = "balanced",
condition = c("zone", "race", "sex"),
draw = FALSE) %>%
as_tibble() %>%
group_by(zone, sex) %>%
mutate(sort_order = estimate[race == "Black"]) %>%
ungroup() %>%
mutate(zone = reorder_within(zone, sort_order, sex)) %>%
ggplot(aes(x = zone, color = race)) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
width = 0.2,
position = position_dodge(width = 0.5)) +
geom_point(aes(y = estimate),
size = 1,
position = position_dodge(width = 0.5)) +
facet_wrap(~sex, scales = "free_x") +
scale_x_reordered() +
theme(axis.text.x = element_text(size = 8)) +
scale_y_continuous(labels = percent_format()) +
labs(title = "Predicted Arrest Rate of New Orleans Motorists by Zones",
subtitle = "Black motorists are more likely to get arrested during a traffic stop than White motorists.",
x = "Zone",
y = "Estimated Arrest Probability (%)",
caption = "Data from the Stanford Open Policing Project",
color = "Race") |
| temperance-18 |
question |
> tutorial.helpers::show_file("stops.qmd", pattern = "marginaleffects|tidytext")
library(tidytext)
library(marginaleffects)
> tutorial.helpers::show_file("stops.qmd", start = -8)
scale_y_continuous(labels = percent_format()) +
labs(title = "Predicted Arrest Rate of New Orleans Motorists by Zones",
subtitle = "Black motorists are more likely to get arrested during a traffic stop than White motorists.",
x = "Zone",
y = "Estimated Arrest Probability (%)",
caption = "Data from the Stanford Open Policing Project",
color = "Race")
```
> |
| temperance-19 |
question |
For example, the difference in predicted arrest probability between Black and White male drivers in Zone W is approximately 12 percentage points, with a 95% confidence interval that does not overlap zero, indicating a statistically significant disparity. |
| temperance-20 |
question |
The estimates for the quantities of interest might be wrong because the model assumes no unmeasured confounders, yet factors like officer discretion, stop severity, or missing data could bias results. Additionally, the linear model may oversimplify the binary outcome, potentially underestimating uncertainty. A more appropriate approach could involve using a logistic regression model, which better fits binary outcomes, and computing robust confidence intervals that account for heteroskedasticity. This might slightly change the estimated arrest rate differences and widen the confidence intervals to reflect greater uncertainty. |
| temperance-21 |
question |
> tutorial.helpers::show_file("stops.qmd")
---
title: "Stops"
Author: "Hassan Ali"
format: html
execute:
echo: false
---
```{r}
#| message: false
library(tidyverse)
library(primer.data)
library(tidymodels)
library(broom)
library(equatiomatic)
library(tidytext)
library(marginaleffects)
```
Arrest rates during traffic stops can vary across different demographic groups, influenced by factors such as race and other driver characteristics. Using data from the Stanford Open Policing Project, which includes over 400,000 traffic stops in New Orleans from 2011 to 2018, we examine whether race predicts the likelihood of being arrested.ChatGPT said:
So far, I have used traffic stop data from the Stanford Open Policing Project, covering over 400,000 stops in New Orleans, to explore whether race predicts the likelihood of arrest. The analysis focuses on differences in arrest rates between Black and White drivers while considering other factors. However, unmeasured confounders, such as officer discretion or unrecorded stop details, may bias the results.I am using a linear regression model with arrested as the dependent variable and predictors including sex, race, and their interaction with zones. The results show that being male is positively associated with a higher likelihood of arrest.For example, the difference in predicted arrest probability between Black and White male drivers in Zone W is approximately 12 percentage points, with a 95% confidence interval that does not overlap zero, indicating a statistically significant disparity.
```{r}
#| label: eda
x <- stops |>
filter(race %in% c("black", "white")) |>
mutate(race = str_to_title(race),
sex = str_to_title(sex))
```
$$
P(Y = 1 \mid X) = \rho = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k)}}
$$
```{r}
#| label: model
#| cache: true
fit_stops <- linear_reg(engine = "lm") %>%
fit(arrested ~ sex + race * zone, data = x)
```
```{r math}
extract_eq(fit_stops$fit,
intercept = "beta",
wrap = TRUE,
use_coefs = TRUE,
terms_per_line = 2)
```
```{r table}
tidy(fit_stops, conf.int = TRUE)
```
```{r plot}
library(scales)
plot_predictions(fit_stops$fit,
newdata = "balanced",
condition = c("zone", "race", "sex"),
draw = FALSE) %>%
as_tibble() %>%
group_by(zone, sex) %>%
mutate(sort_order = estimate[race == "Black"]) %>%
ungroup() %>%
mutate(zone = reorder_within(zone, sort_order, sex)) %>%
ggplot(aes(x = zone, color = race)) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
width = 0.2,
position = position_dodge(width = 0.5)) +
geom_point(aes(y = estimate),
size = 1,
position = position_dodge(width = 0.5)) +
facet_wrap(~sex, scales = "free_x") +
scale_x_reordered() +
theme(axis.text.x = element_text(size = 8)) +
scale_y_continuous(labels = percent_format()) +
labs(title = "Predicted Arrest Rate of New Orleans Motorists by Zones",
subtitle = "Black motorists are more likely to get arrested during a traffic stop than White motorists.",
x = "Zone",
y = "Estimated Arrest Probability (%)",
caption = "Data from the Stanford Open Policing Project",
color = "Race")
```
> |
| temperance-22 |
question |
https://github.com/alisoni007/stops |
| minutes |
question |
160 |