There are several “helper” functions which can simplify the definition of complex patterns. First we define some functions that will help us display the patterns:
function(pat){
one.pattern <-if(is.character(pat)){
patelse{
}::var_args_list(pat)[["pattern"]]
nc
}
} function(...){
show.patterns <- list(...)
L <-str(lapply(L, one.pattern))
}
nc::field
for reducing repetitionThe nc::field
function can be used to avoid repetition when defining patterns of the form variable: value
. The example below shows three (mostly) equivalent ways to write a regex that captures the text after the colon and space; the captured text is stored in the variable
group or output column:
show.patterns(
"variable: (?<variable>.*)", #repetitive regex string
list("variable: ", variable=".*"),#repetitive nc R code
::field("variable", ": ", ".*"))#helper function avoids repetition
nc#> List of 3
#> $ : chr "variable: (?<variable>.*)"
#> $ : chr "(?:variable: (.*))"
#> $ : chr "(?:variable: (?:(.*)))"
Note that the first version above has a named capture group, whereas the second and third patterns generated by nc have an un-named capture group and some non-capturing groups (but they all match the same pattern).
Another example:
show.patterns(
"Alignment (?<Alignment>[0-9]+)",
list("Alignment ", Alignment="[0-9]+"),
::field("Alignment", " ", "[0-9]+"))
nc#> List of 3
#> $ : chr "Alignment (?<Alignment>[0-9]+)"
#> $ : chr "(?:Alignment ([0-9]+))"
#> $ : chr "(?:Alignment (?:([0-9]+)))"
Another example:
show.patterns(
"Chromosome:\t+(?<Chromosome>.*)",
list("Chromosome:\t+", Chromosome=".*"),
::field("Chromosome", ":\t+", ".*"))
nc#> List of 3
#> $ : chr "Chromosome:\t+(?<Chromosome>.*)"
#> $ : chr "(?:Chromosome:\t+(.*))"
#> $ : chr "(?:Chromosome:\t+(?:(.*)))"
nc::quantifier
for fewer parenthesesAnother helper function is nc::quantifier
which makes patterns easier to read by reducing the number of parentheses required to define sub-patterns with quantifiers. For example all three patterns below create an optional non-capturing group which contains a named capture group:
show.patterns(
"(?:-(?<chromEnd>[0-9]+))?", #regex string
list(list("-", chromEnd="[0-9]+"), "?"), #nc pattern using lists
::quantifier("-", chromEnd="[0-9]+", "?"))#quantifier helper function
nc#> List of 3
#> $ : chr "(?:-(?<chromEnd>[0-9]+))?"
#> $ : chr "(?:(?:-([0-9]+))?)"
#> $ : chr "(?:(?:-([0-9]+))?)"
Another example with a named capture group inside an optional non-capturing group:
show.patterns(
"(?: (?<name>[^,}]+))?",
list(list(" ", name="[^,}]+"), "?"),
::quantifier(" ", name="[^,}]+", "?"))
nc#> List of 3
#> $ : chr "(?: (?<name>[^,}]+))?"
#> $ : chr "(?:(?: ([^,}]+))?)"
#> $ : chr "(?:(?: ([^,}]+))?)"
nc::alternatives
for simplified alternationWe also provide a helper function for defining regex patterns with alternation. The following three lines are equivalent.
show.patterns(
"(?:(?<first>bar+)|(?<second>fo+))",
list(first="bar+", "|", second="fo+"),
::alternatives(first="bar+", second="fo+"))
nc#> List of 3
#> $ : chr "(?:(?<first>bar+)|(?<second>fo+))"
#> $ : chr "(?:(bar+)|(fo+))"
#> $ : chr "(?:(bar+)|(fo+))"