Helper functions

2025-01-21

There are several “helper” functions which can simplify the definition of complex patterns. First we define some functions that will help us display the patterns:

one.pattern <- function(pat){
  if(is.character(pat)){
    pat
  }else{
    nc::var_args_list(pat)[["pattern"]]
  }
}
show.patterns <- function(...){
  L <- list(...)
  str(lapply(L, one.pattern))
}

nc::field for reducing repetition

The nc::field function can be used to avoid repetition when defining patterns of the form variable: value. The example below shows three (mostly) equivalent ways to write a regex that captures the text after the colon and space; the captured text is stored in the variable group or output column:

show.patterns(
  "variable: (?<variable>.*)",      #repetitive regex string
  list("variable: ", variable=".*"),#repetitive nc R code
  nc::field("variable", ": ", ".*"))#helper function avoids repetition
#> List of 3
#>  $ : chr "variable: (?<variable>.*)"
#>  $ : chr "(?:variable: (.*))"
#>  $ : chr "(?:variable: (?:(.*)))"

Note that the first version above has a named capture group, whereas the second and third patterns generated by nc have an un-named capture group and some non-capturing groups (but they all match the same pattern).

Another example:

show.patterns(
  "Alignment (?<Alignment>[0-9]+)",
  list("Alignment ", Alignment="[0-9]+"),
  nc::field("Alignment", " ", "[0-9]+"))
#> List of 3
#>  $ : chr "Alignment (?<Alignment>[0-9]+)"
#>  $ : chr "(?:Alignment ([0-9]+))"
#>  $ : chr "(?:Alignment (?:([0-9]+)))"

Another example:

show.patterns(
  "Chromosome:\t+(?<Chromosome>.*)",
  list("Chromosome:\t+", Chromosome=".*"),
  nc::field("Chromosome", ":\t+", ".*"))
#> List of 3
#>  $ : chr "Chromosome:\t+(?<Chromosome>.*)"
#>  $ : chr "(?:Chromosome:\t+(.*))"
#>  $ : chr "(?:Chromosome:\t+(?:(.*)))"

nc::quantifier for fewer parentheses

Another helper function is nc::quantifier which makes patterns easier to read by reducing the number of parentheses required to define sub-patterns with quantifiers. For example all three patterns below create an optional non-capturing group which contains a named capture group:

show.patterns(
  "(?:-(?<chromEnd>[0-9]+))?",                #regex string
  list(list("-", chromEnd="[0-9]+"), "?"),    #nc pattern using lists
  nc::quantifier("-", chromEnd="[0-9]+", "?"))#quantifier helper function
#> List of 3
#>  $ : chr "(?:-(?<chromEnd>[0-9]+))?"
#>  $ : chr "(?:(?:-([0-9]+))?)"
#>  $ : chr "(?:(?:-([0-9]+))?)"

Another example with a named capture group inside an optional non-capturing group:

show.patterns(
  "(?: (?<name>[^,}]+))?",
  list(list(" ", name="[^,}]+"), "?"),
  nc::quantifier(" ", name="[^,}]+", "?"))
#> List of 3
#>  $ : chr "(?: (?<name>[^,}]+))?"
#>  $ : chr "(?:(?: ([^,}]+))?)"
#>  $ : chr "(?:(?: ([^,}]+))?)"

nc::alternatives for simplified alternation

We also provide a helper function for defining regex patterns with alternation. The following three lines are equivalent.

show.patterns(
  "(?:(?<first>bar+)|(?<second>fo+))",
  list(first="bar+", "|", second="fo+"),
  nc::alternatives(first="bar+", second="fo+"))
#> List of 3
#>  $ : chr "(?:(?<first>bar+)|(?<second>fo+))"
#>  $ : chr "(?:(bar+)|(fo+))"
#>  $ : chr "(?:(bar+)|(fo+))"

nc::alternatives_with_shared_groups for alternatives with identical named sub-pattern groups

Sometimes each alternative is just a re-arrangement of the same sub-patterns. For example consider the following subjects, each of which are dates, in one of two formats.

subject.vec <- c("mar 17, 1983", "26 sep 2017", "17 mar 1984")

In each of the two formats, the month consists of three lower-case letters, the day consists of two digits, and the year consists of four digits. Is there a single pattern that can match each of these subjects? Yes, such a pattern can be defined using the code below,

pattern <- nc::alternatives_with_shared_groups(
  month="[a-z]{3}",
  day=list("[0-9]{2}", as.integer),
  year=list("[0-9]{4}", as.integer),
  list(american=list(month, " ", day, ", ", year)),
  list(european=list(day, " ", month, " ", year)))

In the code above, we used nc::alternatives_with_shared_groups, which requires two kinds of arguments:

The pattern can be used for matching, and the result is a data table with one column for each unique name,

(match.dt <- nc::capture_first_vec(subject.vec, pattern))
#>        american  month   day  year    european
#>          <char> <char> <int> <int>      <char>
#> 1: mar 17, 1983    mar    17  1983            
#> 2:                 sep    26  2017 26 sep 2017
#> 3:                 mar    17  1984 17 mar 1984

After having parsed the dates into these three columns, we can add a date column:

Sys.setlocale(locale="C")#to recognize months in English.
#> [1] "LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=fr_FR.UTF-8;LC_PAPER=fr_FR.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=fr_FR.UTF-8;LC_IDENTIFICATION=C"
match.dt[, date := data.table::as.IDate(
  paste(month, day, year), format="%b %d %Y")]
print(match.dt, class=TRUE)
#>        american  month   day  year    european       date
#>          <char> <char> <int> <int>      <char>     <IDat>
#> 1: mar 17, 1983    mar    17  1983             1983-03-17
#> 2:                 sep    26  2017 26 sep 2017 2017-09-26
#> 3:                 mar    17  1984 17 mar 1984 1984-03-17

Another example is parsing given and family names, in two different formats:

nc::capture_first_vec(
  c("Toby Dylan Hocking","Hocking, Toby Dylan"),
  nc::alternatives_with_shared_groups(
    family="[A-Z][a-z]+",
    given="[^,]+",
    list(given_first=list(given, " ", family)),
    list(family_first=list(family, ", ", given))
  )
)
#>           given_first      given  family        family_first
#>                <char>     <char>  <char>              <char>
#> 1: Toby Dylan Hocking Toby Dylan Hocking                    
#> 2:                    Toby Dylan Hocking Hocking, Toby Dylan