ProActive

Jessie Maier

Introduction

ProActive automatically detects regions of gapped and elevated read coverage using a 2D pattern-matching algorithm. ProActive can detect, characterize and visualize read coverage patterns in both genomes and metagenomes. Optionally, users may provide gene predictions associated with their genome or metagenome in the form of a .gff file. In this case, ProActive will generate an additional output table containing the gene predictions found within the detected regions of gapped and elevated read coverage.

Analyzing coverage data is important because gaps or elevations in coverage can provide insight into a community’s (metagenome) or organism’s (genome) genetic activity, for example-

Elevations in read coverage can be an indication of highly active and/or abundant mobile genetic elements (MGEs). MGEs that are actively replicating or are highly abundant may generate more sequencing reads than the rest of their host bacterium’s genome thus creating a region of elevated read coverage at the element’s insertion site(s).

Gaps in read coverage can indicate genetic heterogeneity in a bacterial population. Sub-populations of bacteria with and without specific gene sequences may form gaps or dips in mapped read coverages. Genetic regions with high mutation rates may also generate gaps in read coverage.

Note: The read coverage pattern-matching employed by ProActive is only as good as the provided data. There are other non-biological situations that can create gaps or elevations in read coverage. For example, chimeric assemblies or contaminants can create odd read coverage artifacts. Because of this, ProActive is best used as a screening method to identify genetic regions for further investigation.

Installation

CRAN install

install.packages("ProActive")
library(ProActive)

GitHub install

if (!require("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

devtools::install_github("jlmaier12/ProActive")
library(ProActive)

Input data

Pileups

ProActive detects read coverage patterns using a pattern-matching algorithm that operates on pileup files. A pileup file is a file format where each row summarizes the ‘pileup’ of reads at specific genomic locations. Pileup files can be used to generate a rolling mean of read coverages and associated base pair positions which reduces data size while preserving read coverage patterns. ProActive requires that input pileups files be generated using a 100 bp window/bin size.

Pileup files can be generated by mapping sequencing reads to a metagenome or genome fasta file. Read mapping should be performed using a high minimum identity (0.97 or higher) and random mapping of ambiguous reads. The pileup files needed for ProActive are generated using the .bam files produced during read mapping.

Some read mappers, like BBMap, allow for the generation of pileup files in the bbmap.sh command with use of the bincov output with the covbinsize=100 parameter/argument. Otherwise, BBMap’s pileup.sh can convert .bam files produced by any read mapper to pileup files compatible with ProActive using the bincov output with binsize=100.

Pileup files must use a 100 bp window/bin size for the rolling mean!

The input pileup file for metagenomes must have the following format:

Dataframe with four columns:

V1 V2 V3 V4
NODE_1911 length_44214_cov_4.82142_ID_9560073 54.66 100 175075473
NODE_1911 length_44214_cov_4.82142_ID_9560073 59.13 200 175075573
NODE_1911 length_44214_cov_4.82142_ID_9560073 53.99 300 175075673
NODE_1911 length_44214_cov_4.82142_ID_9560073 54.69 400 175075773
NODE_1911 length_44214_cov_4.82142_ID_9560073 54.40 500 175075873
NODE_1911 length_44214_cov_4.82142_ID_9560073 52.11 600 175075973

Note that the format for a genome pileup will be slightly different! The third column (V3) does not restart and the fourth column (V4) starts from 0. ProActive accounts for the differences in pileup formats between genomes and metagenomes.

Users may also use the ‘sampleMetagenomePileup’ and ‘sampleGenomePileup’ files that come pre-loaded with ProActive as references for proper input file format.

gff TSV

Optionally, ProActive will accept a .gff file as additional input. The .gff file must be associated with the same metagenome or genome used to create your pileup file. The .gff file should be a TSV and should follow the same general format described here.

The input .gff file must have the following format exactly:
V1 V2 V3 V4 V5 V6 V7 V8 V9
NODE_1911 Prodigal:002006 CDS 318 1166 .
0 ID=NJKKNKEE_164175;inference=ab initio prediction:Prodigal:002006;locus_tag=NJKKNKEE_164175;product=hypothetical protein
NODE_1911 Prodigal:002006 CDS 1198 1938 .
0 ID=NJKKNKEE_164176;inference=ab initio prediction:Prodigal:002006;locus_tag=NJKKNKEE_164176;product=hypothetical protein
NODE_1911 Prodigal:002006 CDS 1938 2582 .
0 ID=NJKKNKEE_164177;inference=ab initio prediction:Prodigal:002006;locus_tag=NJKKNKEE_164177;product=hypothetical protein
NODE_1911 Prodigal:002006 CDS 2722 3561 .
0 ID=NJKKNKEE_164178;inference=ab initio prediction:Prodigal:002006;locus_tag=NJKKNKEE_164178;product=hypothetical protein
NODE_1911 Prodigal:002006 CDS 3671 4063 .
0 ID=NJKKNKEE_164179;inference=ab initio prediction:Prodigal:002006;locus_tag=NJKKNKEE_164179;product=hypothetical protein
NODE_1911 Prodigal:002006 CDS 4128 4670 .
0 ID=NJKKNKEE_164180;eC_number=1.11.1.1;Name=rbr1_24;db_xref=COG:COG1592;gene=rbr1_24;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q97FZ9;locus_tag=NJKKNKEE_164180;product=Rubrerythrin-1

(Hint- if you are using a gff file output by PROKKA, you may need to remove some unnecessary (for ProActive) lines of text at the top of the file. There are various ways one can remove these additional lines, however, a nice command-line solution is:)

grep ^COMMONID metagenomeAnnots.gff > metagenomeAnnotsForProActive.gff

The ‘COMMONID’ should be a value that your contig or genome accessions start with that is the same across all the rows. For example, the ‘COMMONID’ for the contig accessions in the sampleMetagenomegffTSV displayed above could be “NODE” since all the accessions start with “NODE”.

ProActive()

ProActive() is the main function in the ProActive R package. This function filters contigs/chunks based on length and read coverage, performs pattern-matching to detect gaps and elevations in read coverage, and identifies start and stop positions and sizes of pattern-matches.

Function components

Chunking

Currently, ProActive() can only detect one gap or elevation pattern per contig. Until ProActive() is able to detect multiple read coverage patterns per contig, we rely on ‘chunking’ large contig into smaller subsets (defined by chunkSize) so that pattern-matching can be performed on each chunk as if it were an individual contig. The chunking mechanism is what allows ProActive() to pattern-match on genomes. When contigs/genomes are chunked, they are assigned a sequential value to link chunks back together (i.e. “NODE_1_chunk_1, NODE_1_chunk_2, NODE_1_chunk_3, …). Note that the remaining ‘chunk’ of a contig/genome may not be long enough to perform pattern-matching on. Chunks too small for pattern-matching will be put in the output FilteredOut table. If a chunk splits a gap or elevation pattern in half, ProActive() will attempt to detect this and report it to the user as a ‘possible pattern-match continuity’ between contig/genome chunks. Pattern-match continuity is detected when two sequential chunks have a partial elevation/gap pattern going off the right and left side of the chunks, respectively.

Filtering

Contigs/chunks that are too short or have little to no read coverage are filtered out prior to pattern-matching. ProActive() filters out contigs/chunks that do not have at least 10x coverage on a total of 5,000 bp across the whole contig/chunk. The read coverage filtering was done in this way to avoid filtering out long contigs/chunks with small elevations in read coverage that might get removed if filtering was done with read coverage averages or medians. Additionally, contigs/chunks less than 30,000 bp are filtered out by default, however this can be changed with the minContigLength parameter which can be set to a minimum of 25,000 bp. If you would like to reduce the size of your input metagenome pileup file for ProActive(), consider pre-filtering your assembly for contigs greater than 25,000 bp prior to read mapping!

Changing pileup windowSize

The input pileup files have 100 bp windows in which the mapped read coverage is averaged over. ProActive() increases the window size prior to pattern-matching by averaging the read coverages over a value specified with windowSize. In many cases, read coverage patterns don’t require the resolution that 100 bp windows provide, however, starting with a 100 bp windowSize means the higher resolution is available if needed. While users can use the 100 bp windowSize for ProActive(), the processing time will be increased significantly and noisy data may interfere with pattern-matching. We find that the default 1,000 bp windowSize provides a nice balance between processing time and read coverage pattern resolution. If you’d like more resolution than the 1,000 bp windowSize provides, consider dropping the windowSize to 500. If you’d like fine scale read coverage resolution, consider viewing the contigs/genome with a software like Integrative Genomics Viewer IGV.

Pattern-matching

ProActive() detects read coverage patterns using a 2D pattern-matching algorithm. Several predefined patterns, described below, are built using the specific length and read coverage values of the contig/chunk being assessed. Patterns are translated across each contig/chunk in 1,000 bp sliding windows and at each translation, a pattern-match score is calculated by taking the mean absolute difference of the read coverage and the pattern values. The smaller the match-score, the better the pattern-match. After a pattern is fully translated across a contig/chunk, certain aspects of the pattern are changed (i.e. height, base, width) and translation is repeated. This process of translation and pattern re-scaling is repeated until a large number of pattern variations are tested. After pattern-matching is complete, the pattern associated with the best match-score is used for contig/chunk classification. Contigs/chunks are classified as ‘Elevation’, ‘Gap’, or ‘NoPattern’ during pattern-matching.

Elevation pattern:

The ‘elevation’ class is defined by a ‘block’ pattern which represents the read covarage pattern formed when reads from a highly active/abundant MGE map back to its respective integration site in the host/originating genome. During pattern-matching, the height (max), base (min) and width are altered and all pattern variations are translated across the contig/chunk. The block pattern width never get smaller than 10,000 bp by default, however this can be changed with the minSize parameter.

Gap pattern:

The ‘gap’ class is essentially the reverse of the ‘elevation’ class. It is defined by a gapped block pattern which represents a depression in read coverage caused by genetic heterogeneity at a specific region leading to poor read mapping. The same pattern-matching steps (alteration of pattern and pattern translation) used for the elevation pattern are used for the gap pattern.

Elevation/Gap pattern:

Elevations and gaps that trail off one side of a contig/chunk are hard to classify as the read coverage could be seen as a gap or elevation depending on how you’re looking at it. Our solution to this problem was to classify the contig/chunk as ‘gap’ if the elevated region was greater than 50% of the length of the contig/chunk. Otherwise the classification is ‘elevated’.

noPattern:

Since the best pattern-match for each contig/chunk is determined by comparing match-scores amongst all pattern-variations from all pattern classes, we needed a ‘negative control’ pattern to compare against. The ‘NoPattern’ ‘pattern’ serves as a negative control by matching to contigs/chunks with no read coverage patterns. We made two NoPattern patterns which consist of a horizontal line the same length as the contig/chunk being assessed at either the contig/chunk’s average or median read coverage. This pattern is not re-scaled or translated in any way.

Calculating elevation ratios

Every gap and elevation classification receives an elevation ratio which is simply the pattern-match’s maximum value divided by the minimum value. For elevation classifications, you can think of the elevation ratio as ‘the elevated region has X times greater read coverage than the non-elevated region’. Conversely, for gap classifications, the elevation ratio can be thought of as ‘the gapped region has X times less read coverage than the non-gapped region’.

Extracting gene annotations in elevated/gapped regions

If a .gff file is provided, then ProActive() will extract the gene annotations found within the gapped and elevated pattern-match regions and provide them to the user in an output table (GeneAnnotTable). An additional column will be added with the classification information (gap or elevation) associated with the gene annotations. If the input .gff file contains a gene ‘product’ field in the attributes column (9th column in the dataframe), then ProActive() will extract the product information into a separate column for easy visualization and filtering of annotations of interest.

Usage

Default arguments in metagenome mode:

ProActiveOutputMetagenome <- ProActive(
  pileup = sampleMetagenomePileup,
  mode = "metagenome",
  gffTSV = sampleMetagenomegffTSV
)
#> Preparing input file for pattern-matching...
#> Starting pattern-matching...
#> A quarter of the way done with pattern-matching
#> Half of the way done with pattern-matching
#> Almost done with pattern-matching!
#> Summarizing pattern-matching results
#> Finding gene predictions in elevated or gapped regions of read coverage...
#> Finalizing output
#> Execution time: 2.13secs
#> 0 contigs were filtered out based on low read coverage
#> 0 contigs were filtered out based on length (< minContigLength)
#> 
#> Elevation       Gap NoPattern 
#>         3         3         1

Default arguments in genome mode:

ProActiveOutputGenome <- ProActive(
  pileup = sampleGenomePileup,
  mode = "genome",
  gffTSV = sampleGenomegffTSV
)
#> Preparing input file for pattern-matching...
#> Starting pattern-matching...
#> A quarter of the way done with pattern-matching
#> Half of the way done with pattern-matching
#> Almost done with pattern-matching!
#> Summarizing pattern-matching results
#> Finding gene predictions in elevated or gapped regions of read coverage...
#> Finalizing output
#> Execution time: 29.12secs
#> 0 contigs were filtered out based on low read coverage
#> 0 contigs were filtered out based on length (< minContigLength)
#> 
#> Elevation       Gap NoPattern 
#>        25         3        21

Note that ProActive can be run without the gffTSV file!

Arguments/parameters

ProActive(pileup,
  mode,
  gffTSV,
  windowSize = 1000,
  minSize = 10000,
  maxSize = Inf,
  minContigLength = 30000,
  chunkSize = 50000,
  chunkContigs = FALSE,
  IncludeNoPatterns = FALSE,
  verbose = TRUE,
  saveFilesTo
)

Output

The output of ProActive is a list containing six objects:

  1. SummaryTable: A table containing all pattern-matching classifications
  2. CleanSummaryTable: A table containing only gap and elevation pattern-match classifications (i.e. noPattern classifications removed)
  3. PatternMatches: A list object containing information needed to visualize the pattern-matches in plotProActiveResults()
  4. FilteredOut: A table containing contigs/chunks that were filtered out for being too small or having too low read coverage
  5. Arguments: A list object containing arguments used for pattern-matching (windowSize, mode, chunkSize, chunkContigs)
  6. GeneAnnotTable: A table containing gene predictions associated with elevated or gapped regions in pattern-matches

Save the desired list item to a new variable using its associated name.

Metagenome results summary table:

MetagenomeCleanSummaryTable <- ProActiveOutputMetagenome$CleanSummaryTable
refName classification elevRatio startPos endPos matchSize
NODE_1911 Elevation 3.349296 1000 17000 16000
NODE_1583 Elevation 2.450013 42000 51000 9000
NODE_1884 Gap 3.319514 36000 44000 8000
NODE_1255 Gap 5.318072 1000 20000 19000
NODE_368 Gap 2.690172 26000 56000 30000
NODE_617 Elevation 1.784556 34000 82000 48000

Subset of genome results summary table:

GenomeCleanSummaryTable <- head(ProActiveOutputGenome$CleanSummaryTable)
refName classification elevRatio startPos endPos matchSize
3 NC_003197.2_chunk_3 Gap 1.609512 83000 100000 17000
4 NC_003197.2_chunk_4 Elevation 1.351684 39000 51000 12000
7 NC_003197.2_chunk_7 Elevation 1.343521 92000 100000 8000
8 NC_003197.2_chunk_8 Elevation 1.723227 84000 94000 10000
10 NC_003197.2_chunk_10 Elevation 1.887571 64000 100000 36000
12 NC_003197.2_chunk_12 Elevation 1.293144 34000 44000 10000

Subset of GeneAnnotTable for metagenome results:

MetagenomeResultsGenePredictTable <- head(ProActiveOutputMetagenome$GeneAnnotTable)
seqid source type start end score strand phase attributes geneproduct Classification
NODE_1911 Prodigal:002006 CDS 318 1166 .
0 ID=NJKKNKEE_164175;inference=ab initio prediction:Prodigal:002006;locus_tag=NJKKNKEE_164175;product=hypothetical protein hypothetical protein Elevation
NODE_1911 Prodigal:002006 CDS 1198 1938 .
0 ID=NJKKNKEE_164176;inference=ab initio prediction:Prodigal:002006;locus_tag=NJKKNKEE_164176;product=hypothetical protein hypothetical protein Elevation
NODE_1911 Prodigal:002006 CDS 1938 2582 .
0 ID=NJKKNKEE_164177;inference=ab initio prediction:Prodigal:002006;locus_tag=NJKKNKEE_164177;product=hypothetical protein hypothetical protein Elevation
NODE_1911 Prodigal:002006 CDS 2722 3561 .
0 ID=NJKKNKEE_164178;inference=ab initio prediction:Prodigal:002006;locus_tag=NJKKNKEE_164178;product=hypothetical protein hypothetical protein Elevation
NODE_1911 Prodigal:002006 CDS 3671 4063 .
0 ID=NJKKNKEE_164179;inference=ab initio prediction:Prodigal:002006;locus_tag=NJKKNKEE_164179;product=hypothetical protein hypothetical protein Elevation
NODE_1911 Prodigal:002006 CDS 4128 4670 .
0 ID=NJKKNKEE_164180;eC_number=1.11.1.1;Name=rbr1_24;db_xref=COG:COG1592;gene=rbr1_24;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:Q97FZ9;locus_tag=NJKKNKEE_164180;product=Rubrerythrin-1 Rubrerythrin-1 Elevation

Subset of GeneAnnotTable for genome results:

GenomeResultsGenePredictTable <- head(ProActiveOutputGenome$GeneAnnotTable)
seqid source type start end score strand phase attributes geneproduct Classification
NC_003197.2_chunk_3 Prodigal:002006 CDS 82564 84117 .
0 ID=LFLNNMPD_00070;eC_number=6.2.1.48;Name=caiC;db_xref=COG:COG0318;gene=caiC;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P31552;locus_tag=LFLNNMPD_00070;product=Crotonobetaine/carnitine–CoA ligase Crotonobetaine/carnitine–CoA ligase Gap
NC_003197.2_chunk_3 Prodigal:002006 CDS 84180 85397 .
0 ID=LFLNNMPD_00071;eC_number=2.8.3.21;Name=caiB;db_xref=COG:COG1804;gene=caiB;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P31572;locus_tag=LFLNNMPD_00071;product=L-carnitine CoA-transferase L-carnitine CoA-transferase Gap
NC_003197.2_chunk_3 Prodigal:002006 CDS 85509 86651 .
0 ID=LFLNNMPD_00072;eC_number=1.3.8.13;Name=caiA_1;db_xref=COG:COG1960;gene=caiA_1;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P60584;locus_tag=LFLNNMPD_00072;product=Crotonobetainyl-CoA reductase Crotonobetainyl-CoA reductase Gap
NC_003197.2_chunk_3 Prodigal:002006 CDS 86686 88203 .
0 ID=LFLNNMPD_00073;Name=caiT;db_xref=COG:COG1292;gene=caiT;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P31553;locus_tag=LFLNNMPD_00073;product=L-carnitine/gamma-butyrobetaine antiporter L-carnitine/gamma-butyrobetaine antiporter Gap
NC_003197.2_chunk_3 Prodigal:002006 CDS 88684 89454 .
0 ID=LFLNNMPD_00074;Name=fixA_1;db_xref=COG:COG2086;gene=fixA_1;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P60566;locus_tag=LFLNNMPD_00074;product=Protein FixA Protein FixA Gap
NC_003197.2_chunk_3 Prodigal:002006 CDS 89470 90411 .
0 ID=LFLNNMPD_00075;Name=fixB_1;db_xref=COG:COG2025;gene=fixB_1;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P31574;locus_tag=LFLNNMPD_00075;product=Protein FixB Protein FixB Gap

plotProActiveResults()

plotProActiveResults() allows users to visualize both the read coverage and the pattern-match associated with each gap or elevation classification.

Function components

Re-building pattern-matches

The ProActive() output contains information needed to re-build each pattern-match used for classification. To re-build a complete pattern-match for visualization, plotProActiveResults() uses the pattern-match’s minimum and maximum values and the start and stop positions.

Plotting read coverage and associated pattern-matches

By default, the read coverage is plotted for each contig/chunk classified as having a gap or elevation in read coverage. If you wish to see the read coverage for noPattern classifications, be sure to set IncludeNoPatterns = TRUE when running ProActive(). The pattern-match associated with each classification is overlaid on the coverage plot.

Usage

Default arguments:

MetagenomeResultsPlots <- plotProActiveResults(
  pileup = sampleMetagenomePileup,
  ProActiveResults = ProActiveOutputMetagenome
)

GenomeResultsPlots <- plotProActiveResults(
  pileup = sampleGenomePileup,
  ProActiveResults = ProActiveOutputGenome
)

Note- There is no need to set ‘genome’ or ‘metagenome’ mode. plotProActiveResults() will get this information from the ProActive() output.

Arguments/parameters

plotProActiveResults(pileup,
  ProActiveResults,
  elevFilter,
  saveFilesTo
)

Output

The output of plotProActiveResults is a list of ggplot objects.

View select metagenome plots

MetagenomeResultsPlots

View select genome plots

Notice the ‘chunk’ information in the plot titles

GenomeResultsPlots$NC_003197.2_chunk_36

GenomeResultsPlots$NC_003197.2_chunk_8

Session Information

sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4      stringr_1.5.1    ggplot2_3.5.1    kableExtra_1.4.0
#> [5] ProActive_0.0.2 
#> 
#> loaded via a namespace (and not attached):
#>  [1] bslib_0.8.0       compiler_4.2.1    pillar_1.9.0      jquerylib_0.1.4  
#>  [5] tools_4.2.1       digest_0.6.37     viridisLite_0.4.2 jsonlite_1.8.9   
#>  [9] evaluate_1.0.1    lifecycle_1.0.4   tibble_3.2.1      gtable_0.3.6     
#> [13] pkgconfig_2.0.3   rlang_1.1.4       cli_3.6.2         rstudioapi_0.17.1
#> [17] yaml_2.3.10       xfun_0.49         fastmap_1.2.0     withr_3.0.2      
#> [21] xml2_1.3.6        knitr_1.49        systemfonts_1.1.0 generics_0.1.3   
#> [25] vctrs_0.6.5       sass_0.4.9        grid_4.2.1        tidyselect_1.2.1 
#> [29] svglite_2.1.3     glue_1.8.0        R6_2.5.1          fansi_1.0.6      
#> [33] rmarkdown_2.29    farver_2.1.2      magrittr_2.0.3    scales_1.3.0     
#> [37] htmltools_0.5.8.1 colorspace_2.1-1  labeling_0.4.3    utf8_1.2.4       
#> [41] stringi_1.8.4     munsell_0.5.1     cachem_1.1.0