id submission_type answer
tutorial-id none stops
name question Jack Xu
email question bluebird.jack.xu@gmail.com
introduction-1 question Wisdom, Justice, Courage, Temperance
the-question-1 exercise library(tidyverse)
the-question-2 exercise library(primer.data)
the-question-3 question This data is from the Stanford Open Policing Project, which aims to improve police accountability and transparency by providing data on traffic stops across the United States. The New Orleans dataset includes detailed information about traffic stops conducted by the New Orleans Police Department.
the-question-4 question "arrested" variable is the outcome
the-question-5 question "body_camera_on" is a variable that is TRUE when an police officer wears a body camera that may affect arrest rates. We can randomly assign the variable to police officers for research purposes.
the-question-6 question There are two outcomes for each arrest because one outcome comes from people with no treatment and the other comes from people with the treatment.
the-question-7 question "mask" can hold two values: TRUE or FALSE for a single unit. Let's assume that the potential outcome if "mask" is TRUE for one unit is FALSE, or if "mask" is FALSE for this unit, the variable is TRUE. The causal effect for this unit is treatment outcome minus control outcome, or FALSE - TRUE = 0 - 1 = -1.
the-question-8 question "race" since this can subconsciously affect decisions of police officers
the-question-9 question "white" and "black" are values for "race"
the-question-10 question Which race is more likely to have more arrests in traffic stops?
wisdom-1 question The question, Preceptor Table, and the assumption of validity.
wisdom-2 question A Preceptor Table is the most concise target data we wish we had already.
wisdom-3 question A row represents one unit, a single column with the type of value we want represents the outcome, and the other columns are the covariates.
wisdom-4 question > show_file("stops.qmd") --- title: "Stops" author: "Jack Xu" format: html --- >
wisdom-5 question The units for this problem are the drivers in New Orleans.
wisdom-6 question The outcome variable is the arrest state for each traffic stop.
wisdom-7 question Wearing a mask, sleeping time, location
wisdom-8 question There is no treatment for this problem.
wisdom-9 question After each stop was recorded
wisdom-10 question The difference between one treatment effect and one control effect, or the two potential outcomes.
wisdom-11 question It means only one potential outcome per unit can be observed.
wisdom-12 question If one of the covariates, like age, could be manipulated, we could observe how it affects the result, the variable "arrested".
wisdom-13 question The rows represent each stop. The outcome is the column "arrested". The main covariate is "race".
wisdom-14 question > show_file("stops.qmd", start = -5) ```{r} #| message: false library(tidyverse) library(primer.data) ``` >
wisdom-15 question Validity is how much the Preceptor Table and our data are related by how they both come from the same population.
wisdom-16 question The "Zone" column might be different as the data might come from a different state.
wisdom-17 question Traffic stops have people of different races, some of those people can get arrested. We want to understand the difference in arrest rates between Black and White drivers in New Orleans, based on data from the Stanford Open Policing Project.
wisdom-18 question > tutorial.helpers::show_file("stops.qmd", chunk = "last") --- title: "Stops" author: "Jack Xu" format: html execute: echo: false --- ```{r} #| message: false library(tidyverse) library(primer.data) ``` ```{r} #| label: eda x <- stops |> filter(race %in% c("black", "white")) |> mutate(race = str_to_title(race), sex = str_to_title(sex)) ``` # Summary Traffic stops have people of different races, some of those people can get arrested. We want to understand the difference in arrest rates between Black and White drivers in New Orleans, based on data from the Stanford Open Policing Project. >
justice-1 question These are stability, representativeness, confoundedness, and the construction of the Population Table.
justice-2 question It is the data of all reasonable observations that can figure out the quantity of interest.
justice-3 question The assumption of stability means the rows or observations are still meaningful across time.
justice-4 question The "age" can have a different average age in the data compared to in the Preceptor Table due to immigration from a country with a higher average age.
justice-5 question The data is an accurate sample of the population, and the population is a precise representation of the Preceptor Table.
justice-6 question The data's observations were traffic stops only in New Orleans, which does not represent every single traffic stop.
justice-7 question The population may have different regions that have different arrest policies compared to the Preceptor Table. Those regions can have stricter policies, making the Preceptor Table a biased sample from the population.
justice-8 question It means no hidden variables other than the treatment are affecting the potential outcomes.
justice-9 question Traffic stops have people of different races, some of those people can get arrested. We want to understand the difference in arrest rates between Black and White drivers in New Orleans, based on data from the Stanford Open Policing Project. We understood what the population represents and described the biasing factors that affect the data. One of those misleading factors was how the arrest laws change over time.
courage-1 question The component of Courage is the model from our question from math and implements it from code.
courage-2 exercise library(tidymodels)
courage-3 exercise library(broom)
courage-4 question \text{logit}(\Pr(Y = 1)) = \log\left(\frac{\Pr(Y = 1)}{1 - \Pr(Y = 1)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k
courage-5 question > tutorial.helpers::show_file("stops.qmd", pattern = "library") library(tidyverse) library(primer.data) library(tidymodels) library(broom) >
courage-6 exercise linear_reg(engine = "lm")
courage-7 exercise linear_reg(engine = "lm") |> fit(arrested ~ sex, data = x)
courage-8 exercise linear_reg(engine = "lm") |> fit(arrested ~ sex, data = x) |> tidy(conf.int = TRUE)
courage-9 exercise linear_reg(engine = "lm") |> fit(arrested ~ race, data = x)
courage-10 exercise linear_reg(engine = "lm") |> fit(arrested ~ race, data = x) |> tidy(conf.int = TRUE)
courage-11 exercise linear_reg(engine = "lm") |> fit(arrested ~ sex + race, data = x) |> tidy(conf.int = TRUE)
courage-12 exercise linear_reg(engine = "lm") |> fit(arrested ~ sex + race*zone, data = x) |> tidy(conf.int = TRUE)
courage-13 exercise fit_stops
courage-15 exercise library(easystats)
courage-17 exercise check_predictions(extract_fit_engine(fit_stops))
courage-18 question \widehat{P}(Y = 1) = \frac{1}{1 + \exp\left( - \left( 0.204 + 0.0631 \times \text{sexMale} - 0.0450 \times \text{raceWhite} \right) \right)}
courage-19 question > tutorial.helpers::show_file("stops.qmd", start = -8) #| cache: true fit_stops <- linear_reg(engine = "lm") |> fit(arrested ~ sex + race*zone, data = x) |> tidy(conf.int = TRUE) ``` # Summary Traffic stops have people of different races, some of those people can get arrested. We want to understand the difference in arrest rates between Black and White drivers in New Orleans, based on data from the Stanford Open Policing Project. We understood that the population represents all drivers for finding the true quantity of interest and described the biasing factors, including changing arrest policies and biased officers, that may affect the data. One of those misleading factors was that arrest laws change over time, which makes the data unrepresentative of the population. >
courage-20 question > tutorial.helpers::show_file(".gitignore") *files *_cache >
courage-21 exercise tidy(fit_stops, conf.int = TRUE)
courage-22 question > tutorial.helpers::show_file("stops.qmd", chunk = "Last") fit_stops %>% select(term, estimate, conf.low, conf.high) %>% mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) %>% kable() >
courage-23 question Traffic stops have people of different races, some of those people can get arrested. We want to understand the difference in arrest rates between Black and White drivers in New Orleans, based on data from the Stanford Open Policing Project. We understood that the population represents all drivers for finding the true quantity of interest and described the biasing factors, including changing arrest policies and biased officers, that may affect the data. One of those misleading factors was that arrest laws change over time, which makes the data unrepresentative of the population. I model rates of arrest, a variable holding either TRUE or FALSE, as a logistic function of "arrest" and the covariates "sex", "race", and "zone".
temperance-1 question Temparence uses the data generating mechanism to make a best guess at the quantity of interest, while acknowledging that the guess is not the true quantity.
temperance-2 question Being male increases the chance of being arrested by a log-odds of 0.06.
temperance-3 question White drivers have -0.04 lower value of being arrested compared to Black drivers, which means they are less likely to be arrested than Black drivers.
temperance-4 question A Black woman in Zone A has a 0.18 log odds of being arrested.
temperance-5 exercise library(marginaleffects)
temperance-6 question We are investigating traffic stops. We want to know the arrest rates for White and Black drivers in each traffic stop accounting for zone, sex, and race.
temperance-7 exercise plot_predictions(fit_stops, condition = c("sex", "race"))
temperance-8 exercise plot_predictions(fit_stops$fit, newdata = "balanced", condition = c("zone", "race", "sex"), draw = FALSE) |> as_tibble() |> group_by(zone, sex) |> mutate(sort_order = estimate[race == "Black"]) |> ungroup() |> mutate(zone = reorder_within(zone, sort_order, sex)) |> ggplot(aes(x = zone, color = race)) + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2, position = position_dodge(width = 0.5)) + geom_point(aes(y = estimate), size = 1, position = position_dodge(width = 0.5)) + facet_wrap(~ sex, scales = "free_x") + scale_x_reordered() + theme(axis.text.x = element_text(size = 8)) + scale_y_continuous(labels = percent_format())
temperance-9 question # Remove the call to tidy() for creating fit_stops to create fit_stops_model plot_predictions(fit_stops_model, newdata = "balanced", condition = c("zone", "race", "sex"), draw = FALSE) |> as_tibble() |> group_by(zone, sex) |> mutate(sort_order = estimate[race == "Black"]) |> ungroup() |> mutate(zone = reorder_within(zone, sort_order, sex)) |> ggplot(aes(x = zone, color = race)) + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2, position = position_dodge(width = 0.5)) + geom_point(aes(y = estimate), size = 1, position = position_dodge(width = 0.5)) + facet_wrap(~ sex, scales = "free_x") + scale_x_reordered() + theme(axis.text.x = element_text(size = 8)) + scale_y_continuous(labels = percent_format()) + labs( title = "Arrest Rates of White and Black Motorists in Traffic Stops", subtitle = "White motorists usually have less arrest rates than Black motorists", caption = "Data: Stanford Open Policing Project (2018)", x = "Zone", y = "Arrest Rate", color = "Race" )
temperance-10 question > tutorial.helpers::show_file("stops.qmd", start = -8) x = "Zone", y = "Arrest Rate", color = "Race" ) ``` # Summary Traffic stops have people of different races, some of those people can get arrested. We want to understand the difference in arrest rates between Black and White drivers in New Orleans, based on data from the Stanford Open Policing Project. We understood that the population represents all drivers for finding the true quantity of interest and described the biasing factors, including changing arrest policies and biased officers, that may affect the data. One of those misleading factors was that arrest laws change over time, which makes the data unrepresentative of the population. I model rates of arrest, a variable holding either TRUE or FALSE, as a logistic function of "sex", "race", and "zone". This shows us that White drivers are less likely of getting arrested than females. >
temperance-11 question Our quantity of interest is the difference of arrest rates between White and Black motorists. Using the approximate information from the table, being White decreases the log-odds by -0.045, with a 95% confidence interval ranging from -0.057 and -0.032. The interval does not include 0, indicating statistical significance.
temperance-12 question Officers might include arrest rates depending on the arrest boolean, for example the data included can have more "arrested = 1" values than normal. There are also many rows with missing values, which does not assume representativeness. We can lower the values for the percentages due to this excess of "arrested = 1" values.
temperance-13 question > tutorial.helpers::show_file("stops.qmd") --- title: "Stops" author: "Jack Xu" format: html execute: echo: false --- **Linear Function** $$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + \varepsilon $$ **Fitted Model** $$ \widehat{Y} = 0.204 + 0.0631 \times \text{sex}_{\text{Male}} - 0.0450 \times \text{race}_{\text{White}} $$ ```{r} #| message: false library(tidyverse) library(primer.data) library(tidymodels) library(broom) library(knitr) library(marginaleffects) library(tidytext) ``` ```{r} #| label: eda x <- stops |> filter(race %in% c("black", "white")) |> mutate(race = str_to_title(race), sex = str_to_title(sex)) ``` ```{r} #| cache: true fit_stops_model <- linear_reg(engine = "lm") |> fit(arrested ~ sex + race*zone, data = x) fit_stops <- fit_stops_model |> tidy(conf.int = TRUE) ``` **Estimates and Confience Intervals of the Model** ```{r} fit_stops %>% select(term, estimate, conf.low, conf.high) %>% mutate(across(c(estimate, conf.low, conf.high), \(x) round(x, 3))) %>% kable() ``` # Plot ```{r} plot_predictions(fit_stops_model, newdata = "balanced", condition = c("zone", "race", "sex"), draw = FALSE) |> as_tibble() |> group_by(zone, sex) |> mutate(sort_order = estimate[race == "Black"]) |> ungroup() |> mutate(zone = reorder_within(zone, sort_order, sex)) |> ggplot(aes(x = zone, color = race)) + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2, position = position_dodge(width = 0.5)) + geom_point(aes(y = estimate), size = 1, position = position_dodge(width = 0.5)) + facet_wrap(~ sex, scales = "free_x") + scale_x_reordered() + theme(axis.text.x = element_text(size = 8)) + scale_y_continuous(labels = percent_format()) + labs( title = "Arrest Rates of White and Black Motorists in Traffic Stops", subtitle = "White motorists usually have less arrest rates than Black motorists", caption = "Data: Stanford Open Policing Project (2018)", x = "Zone", y = "Arrest Rate", color = "Race" ) ``` # Summary Traffic stops have people of different races, some of those people can get arrested. We want to understand the difference in arrest rates between Black and White drivers in New Orleans, based on data from the Stanford Open Policing Project. We understood that the population represents all drivers for finding the true quantity of interest and described the biasing factors, including changing arrest policies and biased officers, that may affect the data. One of those misleading factors was that arrest laws change over time, which makes the data unrepresentative of the population. I model rates of arrest, a variable holding either TRUE or FALSE, as a linear function of "sex", "race", and "zone". This shows us that White drivers are less likely of getting arrested than Black drivers. Our quantity of interest is the difference of arrest rates between White and Black motorists. Using the approximate information from the fitted model, the probability is about 24.7% for a Black driver getting arrested and 20.3% for a White driver getting arrested. >
temperance-14 question https://jackxu3.github.io/stops/
temperance-15 question https://github.com/jackxu3/stops
minutes question 314