id submission_type answer
tutorial-id none 131-stops
name question Faran Abbas
email question faranabbas@hotmail.com
ID question Faran Abbas
introduction-1 question Wisdom Justice Courage Temperance
introduction-2 question > show_file(".gitignore") /.quarto/ stops_files >
introduction-3 question > show_file("stops.qmd", chunk = "Last") library(tidyverse) library(primer.data) Warning message: In readLines(path) : incomplete final line found on 'stops.qmd' >
introduction-4 question > library(tidyverse) >
introduction-5 question This data is from the Stanford Open Policing Project, which aims to improve police accountability and transparency by providing data on traffic stops across the United States. The New Orleans dataset includes detailed information about traffic stops conducted by the New Orleans Police Department.
introduction-6 question It's the gap between two potential worlds: one where the cause happens, and one where it doesn’t.
introduction-7 question The fundamental problem of causal inference is that we can never observe both potential outcomes for the same unit we can’t see what would have happened if things had been different.
introduction-8 question arrest_made
introduction-9 question A binary, manipulable variable could be officer_bodycam_on (1 = camera on, 0 = camera off), which can be controlled by requiring officers to activate body cameras during all stops.
introduction-10 question There are two potential outcomes for each arrest one if mask = 1 (e.g., bodycam on) and one if mask = 0 (e.g., bodycam off) because each binary treatment implies a different possible arrest outcome.
introduction-11 question Let mask = 1 mean bodycam on, and mask = 0 mean bodycam off; for one driver, suppose the potential outcome is no arrest if mask = 1 and an arrest if mask = 0, so the causal effect is arrest_0 − arrest_1 = 1 − 0 = 1, meaning the bodycam prevented an arrest.
introduction-12 question The variable reason_for_stop likely has an important connection to arrested, as serious violations may increase arrest likelihood.
introduction-13 question Two groups could be Black drivers and White drivers, who might have different average arrest rates during traffic stops.
introduction-14 question Can we predict whether a driver is arrested during a traffic stop based on their race?
wisdom-1 question Exploratory Data Analysis (EDA) Preceptor Table Validity
wisdom-2 question Define the causal structure clearly, including potential outcomes and treatment assignments, even if you can’t observe them directly
wisdom-3 question A Preceptor Table is a causal reasoning that lays out, for each unit, the potential outcomes under different values of a manipulable treatment (or covariate). Units 🧍: The individual entities we're making inferences about—could be people, cities, time points, etc. Covariates: Attributes or treatments that vary across units and may affect outcomes; one is designated as the treatment and is assumed manipulable. Potential Outcomes: For each unit, the outcomes under both the treated and untreated condition—even if only one is observed. Treatment Assignment: Whether the unit actually received treatment or control.
wisdom-4 question The units are individual traffic stops involving specific drivers.
wisdom-5 question The outcome variable is whether an arrest was made during the traffic stop (`arrest_made`).
wisdom-6 question covariate could be the driver's prior criminal record.
wisdom-7 question The treatments are hypothetical manipulations like activating a body camera (bodycam_on = 1) versus not (bodycam_on = 0).
wisdom-8 question The Preceptor Table refers to the moment when the arrest decision is made during a traffic stop.
wisdom-9 question The Preceptor Table shows each traffic stop’s arrest outcome alongside the driver’s race and other covariates to analyze arrest patterns and fairness.
wisdom-10 question Are Black drivers more likely than White drivers to be arrested during traffic stops after controlling for age, gender, reason, location, and time?
wisdom-11 question Arrests during traffic stops can reflect broader patterns of justice and inequality influenced by factors like race. Using data from the City Police Department covering 10,000 stops in 2023, we investigate whether Black drivers face higher arrest rates than White drivers after accounting for age, gender, and stop reasons.
justice-1 question Population Table Stability Representativeness Unconfoundedness
justice-2 question Validity concerns the relationship between the columns in the Preceptor Table and the data
justice-3 question The assumption of validity might fail if the `arrested` column contains errors or omissions, such as arrests that were made but not recorded.
justice-4 question The Population Table includes rows from three sources: the Preceptor Table, the actual data, and all other members of the population.
justice-5 question Each row represents a single traffic stop (unit) at a specific date and time during 2023.
justice-6 question Stability means assuming that the relationships we see in our datalike how treatment affects outcomestay consistent across time, context, and the broader population we're analyzing.
justice-7 question Stability might fail if policing practices or arrest policies changed during 2023, altering arrest probabilities over time.
justice-8 question Representativeness means the data accurately reflects the characteristics of the broader population we want to study or make decisions about.
justice-9 question Representativeness might fail if the data only includes stops from certain neighborhoods or times, missing parts of the overall population.
justice-10 question Representativeness may fail if the Preceptor Table excludes certain stops or groups present in the Population, causing biased inference.
justice-11 question Unconfoundedness means that all variables affecting both the treatment and outcome are measured, so there are no hidden confounders biasing the results.
justice-12 question > library(tidymodels) ── Attaching packages ─────────────────────────────────── tidymodels 1.3.0 ── ✔ broom 1.0.8 ✔ rsample 1.3.0 ✔ dials 1.4.0 ✔ tune 1.3.0 ✔ infer 1.0.8 ✔ workflows 1.2.0 ✔ modeldata 1.4.0 ✔ workflowsets 1.1.1 ✔ parsnip 1.3.2 ✔ yardstick 1.3.2 ✔ recipes 1.3.1 ── Conflicts ────────────────────────────────────── tidymodels_conflicts() ── ✖ scales::discard() masks purrr::discard() ✖ dplyr::filter() masks stats::filter() ✖ recipes::fixed() masks stringr::fixed() ✖ dplyr::lag() masks stats::lag() ✖ yardstick::spec() masks readr::spec() ✖ recipes::step() masks stats::step() • Dig deeper into tidy modeling with R at https://www.tmwr.org
justice-13 question > library(broom) >
justice-14 question Y = f(X_1, X_2, \ldots, X_p) + \varepsilon
justice-15 question A potential weakness of the model is that unmeasured confounders may bias the estimated effect of race on arrest likelihood.
courage-1 question Courage means committing to a plausible model, testing its limits with transparency, and tracing a believable data-generating story even when uncertainty looms.
courage-2 exercise linear_reg(engine = "lm")
courage-3 exercise linear_reg(engine = "lm")|> fit(arrested ~ sex, data = x)
courage-4 exercise linear_reg() |> set_engine("lm") |> fit(arrested ~ sex, data = x) |> tidy(conf.int = TRUE)
courage-5 exercise linear_reg() |> set_engine("lm") |> fit(arrested ~ race, data = x)
courage-6 exercise linear_reg() |> set_engine("lm") |> fit(arrested ~ race, data = x) |> tidy(conf.int = TRUE)
courage-7 exercise linear_reg() |> set_engine("lm") |> fit(arrested ~ sex + race, data = x) |> tidy(conf.int = TRUE)
courage-8 exercise linear_reg() |> set_engine("lm") |> fit(arrested ~ sex + race*zone, data = x) |> tidy(conf.int = TRUE)
courage-9 exercise fit_stops
courage-10 question > x <- stops |> + filter(race %in% c("black", "white")) |> + mutate(race = str_to_title(race), + sex = str_to_title(sex)) + + fit_stops <- linear_reg() |> + set_engine("lm") |> + fit(arrested ~ sex + race*zone, data = x) > x <- stops |> + filter(race %in% c("black", "white")) |> + mutate(race = str_to_title(race), + sex = str_to_title(sex)) > fit_stops <- linear_reg() |> + set_engine("lm") |> + fit(arrested ~ sex + race*zone, data = x) >
courage-11 question > library(easystats) # Attaching packages: easystats 0.7.5 ✔ bayestestR 0.16.1 ✔ correlation 0.8.8 ✔ datawizard 1.2.0 ✔ effectsize 1.0.1 ✔ insight 1.3.1 ✔ modelbased 0.12.0 ✔ performance 0.15.0 ✔ parameters 0.27.0 ✔ report 0.6.1 ✔ see 0.11.0 >
courage-12 question > check_predictions(extract_fit_engine(fit_interact)) + >
courage-13 question \[ \hat{Y} = 0.177 + 0.061 \cdot \text{Male} - 0.045 \cdot \text{White} + 0.015 \cdot \text{ZoneB} + 0.006 \cdot \text{ZoneC} + 0.078 \cdot \text{ZoneD} + 0.002 \cdot \text{ZoneE} - 0.003 \cdot \text{ZoneF} + 0.031 \cdot \text{ZoneG} + 0.076 \cdot \text{ZoneH} \]
courage-14 question > tutorial.helpers::show_file("stops.qmd", chunk = "Last") #| cache: true fit_stops <- linear_reg() |> set_engine("lm") |> fit(arrested ~ sex + race + zone, data = x)
courage-15 question > tutorial.helpers::show_file(".gitignore") /.quarto/ stops_files *_cache
courage-16 exercise tidy(fit_stops, conf.int = TRUE)
courage-17 question > tutorial.helpers::show_file("stops.qmd", chunk = "Last") #| label: model-table #| echo: false fit_stops |> tidy(conf.int = TRUE) |> select(term, estimate, conf.low, conf.high) |> mutate(across(where(is.numeric), ~ round(., 3))) |> gt() |> tab_header( title = "Model Estimates with 95% Confidence Intervals" ) |> cols_label( term = "Term", estimate = "Estimate", conf.low = "Lower 95% CI", conf.high = "Upper 95% CI" ) >
courage-18 question We model the likelihood of arrest during a traffic stop, a binary outcome, as a linear function of the driver’s sex, race, and the zone in which the stop occurred.
temperance-1 question Temperance in data science is the virtue of restraint: knowing when not to make claims your model can’t support, and when to stop pursuing precision that outstrips your data’s meaning
temperance-2 question The estimate of **0.06 for sexMale** means that, holding race and zone constant, **being male is associated with a 6 percentage point higher probability of arrest** during a traffic stop compared to being female.
temperance-3 question The estimate of -0.04 for raceWhite means that, holding sex and zone constant, being White is associated with a 4 percentage point lower probability of arrest during a traffic stop compared to individuals of other races.
temperance-4 question The intercept estimate of 0.18 means that, for the reference group individuals who are not male, not White, and in zone A the baseline probability of arrest during a traffic stop is approximately 18%.
temperance-5 question > library(marginaleffects) >
temperance-6 question How do demographic characteristics (like sex and race) and geographic location (zone) affect the probability of being arrested during a traffic stop?
temperance-7 question > predictions(fit_stops) Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 % 0.183 0.00282 64.8 <0.001 Inf 0.177 0.188 0.135 0.00292 46.3 <0.001 Inf 0.130 0.141 0.246 0.00409 60.1 <0.001 Inf 0.238 0.254 0.135 0.00292 46.3 <0.001 Inf 0.130 0.141 0.262 0.00737 35.6 <0.001 918.5 0.248 0.277 --- 378457 rows omitted. See ?print.marginaleffects --- 0.212 0.00331 64.0 <0.001 Inf 0.205 0.218 0.274 0.00316 86.7 <0.001 Inf 0.267 0.280 0.274 0.00316 86.7 <0.001 Inf 0.267 0.280 0.274 0.00316 86.7 <0.001 Inf 0.267 0.280 0.185 0.00515 35.9 <0.001 936.9 0.175 0.195 Type: numeric >
temperance-8 question > plot_predictions(fit_stops, by = "sex") >
temperance-10 question > plot_predictions(fit_stops, condition = c("sex", "race")) >
temperance-11 question plot_predictions(fit_stops, by = c("race", "sex")) + labs( title = "Predicted Probability of Arrest by Race and Sex", subtitle = "Arrest probabilities vary significantly across race and sex groups", x = "Group", y = "Predicted Probability of Arrest", caption = "Source: City Police Department Traffic Stop Data" ) + theme_minimal(base_size = 14) + theme( plot.title = element_text(face = "bold"), plot.subtitle = element_text(margin = margin(b = 10)), plot.caption = element_text(size = 10), axis.title.y = element_text(margin = margin(r = 10)) )
temperance-12 question > tutorial.helpers::show_file("stops.qmd", chunk = "Last") library(ggplot2) library(marginaleffects) # Generate the predictions object (assuming you already fit the model as `fit_stops`) preds <- predictions(fit_stops) # Create the plot plot_predictions(fit_stops, by = c("race", "sex")) + labs( title = "Predicted Probability of Arrest by Race and Sex", subtitle = "Arrest probabilities vary significantly across race and sex groups", x = "Group", y = "Predicted Probability of Arrest", caption = "Source: City Police Department Traffic Stop Data" ) + theme_minimal(base_size = 14) + theme( plot.title = element_text(face = "bold"), plot.subtitle = element_text(margin = margin(b = 10)), plot.caption = element_text(size = 10), axis.title.y = element_text(margin = margin(r = 10)) ) >
temperance-13 question The model estimates that being male increases the probability of arrest by about 6 percentage points (95% CI: 5.9% to 6.4%), while being White decreases it by about 4.5 percentage points (95% CI: -5.7% to -3.2%), highlighting measurable disparities with quantified uncertainty.
temperance-14 question The estimates might be biased due to unmeasured confounders such as prior offenses or officer discretion, which are not included in the model. Measurement errors in variables like race or arrest recording could also affect accuracy. Additionally, using a linear probability model for a binary outcome may produce predicted probabilities outside the valid range, inflating uncertainty. A logistic regression model could provide more accurate estimates and more realistic confidence intervals for the probabilities.
temperance-15 question > tutorial.helpers::show_file("stops.qmd") --- title: "Stops" author: "Faran Abbas" execute: echo: false --- ```{r, message=FALSE, warning=FALSE} library(tidyverse) library(primer.data) library(tidymodels) library(broom) library(easystats) library(gt) library(marginaleffects) library(ggplot2) ``` $$Y = f(X_1, X_2, \ldots, X_p) + \varepsilon $$ $$Y \sim \text{Bernoulli}(\rho), \quad \text{with} \quad \rho = f(X_1, X_2, \ldots, X_p) $$ ```{r} x <- stops |> filter(race %in% c("black", "white")) |> mutate(race = str_to_title(race), sex = str_to_title(sex)) ``` ```{r} #| cache: true fit_stops <- linear_reg() |> set_engine("lm") |> fit(arrested ~ sex + race * zone, data = x) ``` ```{r} #| label: model-table #| echo: false fit_stops |> tidy(conf.int = TRUE) |> select(term, estimate, conf.low, conf.high) |> mutate(across(where(is.numeric), ~ round(., 3))) |> gt() |> tab_header( title = "Model Estimates with 95% Confidence Intervals" ) |> cols_label( term = "Term", estimate = "Estimate", conf.low = "Lower 95% CI", conf.high = "Upper 95% CI" ) ``` ```{r} fit_interact <- linear_reg() |> set_engine("lm") |> fit(arrested ~ sex + race * zone, data = x) fit_interact_tidy <- tidy(fit_interact, conf.int = TRUE) fit_interact_tidy ``` ```{r} x <- stops |> filter(race %in% c("black", "white")) |> mutate(race = str_to_title(race), sex = str_to_title(sex)) fit_stops <- linear_reg() |> set_engine("lm") |> fit(arrested ~ sex + race * zone, data = x) ``` $$ \hat{Y} = 0.177 + 0.061 \cdot \text{Male} - 0.045 \cdot \text{White} + 0.015 \cdot \text{ZoneB} + 0.006 \cdot \text{ZoneC} + 0.078 \cdot \text{ZoneD} + 0.002 \cdot \text{ZoneE} - 0.003 \cdot \text{ZoneF} + 0.031 \cdot \text{ZoneG} + 0.076 \cdot \text{ZoneH} $$ ```{r} #| cache: true fit_stops <- linear_reg() |> set_engine("lm") |> fit(arrested ~ sex + race + zone, data = x) ``` ```{r} #| cache: true fit_stops <- linear_reg() |> set_engine("lm") |> fit(arrested ~ sex + race * zone, data = x) ``` ```{r} plot_predictions(fit_stops, by = c("race", "sex")) + labs( title = "Predicted Probability of Arrest by Race and Sex", subtitle = "Arrest probabilities vary significantly across race and sex groups", x = "Group", y = "Predicted Probability of Arrest", caption = "Source: City Police Department Traffic Stop Data" ) + theme_minimal(base_size = 14) + theme( plot.title = element_text(face = "bold"), plot.subtitle = element_text(margin = margin(b = 10)), plot.caption = element_text(size = 10), axis.title.y = element_text(margin = margin(r = 10)) ) ``` ## Summary Paragraph We model the probability of arrest during a traffic stop, where the outcome is binary (arrest or no arrest), as a linear function of the driver’s sex, race, and the zone in which the stop occurred. The model assumes additive effects of each covariate, including interaction between race and location. This structure allows us to estimate how arrest likelihood changes with each predictor while holding others constant. Although the outcome is binary, a linear model offers a simple baseline to detect potential disparities. The model estimates that being male increases the probability of arrest by about 6 percentage points (95% CI: 5.9% to 6.4%), while being White decreases it by about 4.5 percentage points (95% CI: -5.7% to -3.2%), highlighting measurable disparities with quantified uncertainty. The estimates might be biased due to unmeasured confounders such as prior offenses or officer discretion, which are not included in the model. Measurement errors in variables like race or arrest recording could also affect accuracy. Additionally, using a linear probability model for a binary outcome may produce predicted probabilities outside the valid range, inflating uncertainty. A logistic regression model could provide more accurate estimates and more realistic confidence intervals for the probabilities. Arrests during traffic stops can reflect broader patterns of justice and inequality influenced by factors like race. Using data from the City Police Department covering 10,000 stops in 2023, we investigate whether Black drivers face higher arrest rates than White drivers after accounting for age, gender, and stop reasons. A potential weakness of the model is that unmeasured confounders may bias the estimated effect of race on arrest likelihood. Warning message: In readLines(path) : incomplete final line found on 'stops.qmd'
temperance-16 question https://faranabbas-repo.github.io/stops/
temperance-17 question https://github.com/faranabbas-repo/stops.git
minutes question 240