gimap for Treatment Experiment

gimap tutorial for treatments

gimap performs analysis of dual-targeting CRISPR screening data, with the goal of aiding the identification of genetic interactions (e.g. cooperativity, synthetic lethality) in models of disease and other biological contexts. gimap analyzes functional genomic data generated by the pgPEN (paired guide RNAs for genetic interaction mapping) approach, quantifying growth effects of single and paired gene knockouts upon application of a CRISPR library. A multitude of CRISPR screen types can be used for this analysis, with helpful descriptions found in this review (https://www.nature.com/articles/s43586-021-00093-4). Use of pgPEN and GI-mapping in a paired gRNA format can be found here (https://pubmed.ncbi.nlm.nih.gov/34469736/).

library(gimap)
library(dplyr)
output_dir <- "output_treatment"
dir.create(output_dir, showWarnings = FALSE)

Data loading and setup

In this example we are going to examine a dataset with drug treatments and controls.

Let’s examine this example pgPEN counts table. It’s divided into columns containing:

For the purposes of this tutorial, we can grab example data from the package.

example_data <- get_example_data("count_treatment")

The metadata you have may vary slightly from this but you’ll want to make sure you have the essential variables and information regarding how you collected your data.

colnames(example_data)

Setting up data

We’re going to set up three datasets that we will provide to the set_up() function to create a gimap dataset object.

For this example we are using a treatment dataset where there is a pretreatment sample, two vehicle replicates and two drug treated replicates.

counts <- example_data %>%
  select(c("pretreatment", "dmsoA", "dmsoB", "drug1A", "drug1B")) %>%
  as.matrix()

The next datasets are metadata that describe the dimensions of the count data.

pg_id are just the unique IDs listed in the same order/sorted the same way as the count data and can be used for mapping between the count data and the metadata. This is required and very important because it is necessary to know the IDs and be able to map them to pgRNA constructs and counts data.

pg_ids <- example_data %>%
  dplyr::select("id")

Sample metadata is the information that describes the samples and is sorted the same order as the columns in the count data.

You need to have two columns in the metadata you provide. You’ll need to specify the names of these columns in the gimap_annotate() function.

sample_metadata <- data.frame(
  col_names = c("pretreatment", "dmsoA", "dmsoB", "drug1A", "drug1B"),
  drug_treatment = as.factor(c("pretreatment", "dmso", "dmso", "drug", "drug"))
)

We’ll need to provide counts, pg_ids and sample_metadata to setup_data().

Now let’s setup our data using setup_data(). Optionally we can provide the metadata in this function as well so that it is stored with the data.

gimap_dataset <- setup_data(
  counts = counts,
  pg_ids = pg_ids,
  sample_metadata = sample_metadata
)

You’ll notice that this set up gives us a list of formatted data. This contains the original counts we gave setup_data() function but also normalized counts, and the total counts per sample.

str(gimap_dataset)

Let’s see how many rows of data we have.

nrow(gimap_dataset$transformed_data$log2_cpm)

Quality Checks

The first step is running some quality checks on our data. The run_qc() function will create a report we can look at to assess this.

The report includes several visualizations of raw/unfiltered data:

This report also includes several visualizations after filters are applied:

There is a filter that flags pgRNA constructs where any of the time points have a count of zero. - We include a bar plot that shows the number of pgRNA constructs which have counts of zero in either 0, 1, 2, or 3 replicates. - We include a table that specifies how many pgRNAs would be filtered out by applying this filter.

There is a filter that flags pgRNA constructs that have low log2 CPM counts for the day 0 or plasmid time point. - The histogram of the log2 CPM values of pgRNA constructs at the plasmid time point mentioned earlier does have a dashed line specifying the lower outlier (or a user defined cutoff) and pgRNA constructs with a plasmid log2 CPM lower than that value can be filtered out. - We include a table that specifies how many pgRNAs would be filtered out by applying this filter.

There is a filter that flags pgRNA constructs that have low log2 CPM counts for the day 0 or plasmid time point. - The histogram of the log2 CPM values of pgRNA constructs at the plasmid time point mentioned earlier does have a dashed line specifying the lower outlier (or a user defined cutoff) and pgRNA constructs with a plasmid log2 CPM lower than that value can be filtered out. - We include a table that specifies how many pgRNAs would be filtered out by applying this filter.

run_qc(gimap_dataset, open_results = FALSE)

You can take a look at an example QC report here.

Filtering the data

After considering the QC report and which filters are appropriate/desired for your data, you can apply filters to the data using the gimap_filter function.

gimap_filtered <- gimap_dataset %>%
  gimap_filter()

How many rows of data do we have now after filtering?

nrow(gimap_filtered$filtered_data$transformed_log2_cpm)
str(gimap_dataset$filtered_data)

Let’s take a look at how many rows of data we have left:

nrow(gimap_dataset$filtered_data$transformed_log2_cpm)

As you can see from the output above, there are fewer pgRNA constructs in the filtered dataset following completion of filtering.

The filtering step also stores two tables of information that you may want to use or report.

Now that you’ve performed QC and filtering, the rest of the pipeline can be run

Normalization and calculating scores

gimap_dataset <- gimap_filtered %>%
  gimap_annotate(cell_line = "PC9") %>%
  # Whatever is specified for "control_name" is what will be used to normalize other data points
  gimap_normalize(
    treatments = "drug_treatment",
    control_name = "pretreatment"
  ) %>%
  calc_gi()

Here’s what’s included in the GI Scores table:

head(gimap_dataset$gi_scores)

Let’s check out the top results

head(dplyr::arrange(gimap_dataset$gi_score, fdr))

Plot the results

You can remove any samples from these plots by altering the reps_to_drop argument.

plot_exp_v_obs_scatter(gimap_dataset)
plot_rank_scatter(gimap_dataset)
plot_volcano(gimap_dataset)

Plot specific target pair

We can pick out a specific pair to plot.

# "MED12L_MED12" is top result so let's plot that
plot_targets_bar(gimap_dataset, target1 = "MED12L", target2 = "MED12")
# To plot results, pick out two targets from the gi_score table
head(dplyr::arrange(gimap_dataset$gi_score, fdr))

# "NDEL1_NDE1" is top result so let's plot that
plot_targets_bar(gimap_dataset, target1 = "NDEL1", target2 = "NDE1")

Saving results to files

We can save the genetic interactions scores like this:

readr::write_tsv(gimap_dataset$gi_scores, file.path(output_dir, "gi_scores.tsv"))

We can save all these data as an RDS.

saveRDS(gimap_dataset, "gimap_dataset_final_treatment.RDS")

Session Info

This is just for provenance purposes.

sessionInfo()