Adding patient demographics

1.1 Introduction
1.2 Adding characteristics to OMOP CDM tables
1.3 Adding characteristics to a cohort tables
1.4 Getting multiple characteristics at once

1.1 Introduction

The OMOP CDM is a person-centric model. The person table contains records that uniquely identify each individual along with some of their demographic information. Below we create a mock CDM reference which, as is standard, has a person table which contains fields which indicate an individual’s date of birth, gender, race, and ethnicity. Each of these, except for date of birth, are represented by a concept ID (and as the person table contains one record per person these fields are treated as time-invariant).

library(PatientProfiles)
library(duckdb)
library(dplyr)

cdm <- mockPatientProfiles(numberIndividuals = 10000)

cdm$person |>
  dplyr::glimpse()

## Rows: ??
## Columns: 5
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ person_id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ gender_concept_id    <int> 8507, 8507, 8532, 8532, 8507, 8532, 8532, 8532, 8…
## $ year_of_birth        <int> 1957, 1966, 1945, 1947, 1948, 1956, 1926, 1956, 1…
## $ race_concept_id      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ethnicity_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

As well as the person table, every CDM reference will include an observation period table. This table contains spans of times during which an individual is considered to being under observation. Individuals can have multiple observation periods, but they cannot overlap.

cdm$observation_period |>
  dplyr::glimpse()

## Rows: ??
## Columns: 5
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ person_id                     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ observation_period_start_date <date> 1957-01-01, 1966-01-01, 1945-01-01, 194…
## $ observation_period_end_date   <date> 1997-11-22, 1997-08-14, 1974-08-18, 195…
## $ period_type_concept_id        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ observation_period_id         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…

When performing analyses we will often be interested in working with the person and observation period tables to identify individuals’ characteristics on some date of interest. PatientProfiles provides a number of functions that can help us do this.

1.2 Adding characteristics to OMOP CDM tables

Let’s say we’re working with the condition occurrence table.

cdm$condition_occurrence |>
  glimpse()

## Rows: ??
## Columns: 6
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ person_id                 <int> 8662, 5572, 4371, 3898, 2902, 4646, 5185, 14…
## $ condition_start_date      <date> 1942-03-08, 1964-07-09, 1971-09-10, 1944-07…
## $ condition_end_date        <date> 1945-05-14, 1970-08-16, 1973-07-25, 1953-06…
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ condition_concept_id      <int> 4, 4, 1, 10, 8, 1, 2, 3, 6, 7, 1, 6, 3, 3, 3…
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

This table contains diagnoses of individuals and we might, for example, want to identify their age on their date of diagnosis. This involves linking back to the person table which contains their date of birth (split across three different columns). PatientProfiles provides a simple function for this. addAge() will add a new column to the table containing each patient’s age relative to the specified index date.

cdm$condition_occurrence <- cdm$condition_occurrence |>
  addAge(indexDate = "condition_start_date")

cdm$condition_occurrence |>
  glimpse()

## Rows: ??
## Columns: 7
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ person_id                 <int> 8662, 5572, 4371, 3898, 2902, 4646, 5185, 14…
## $ condition_start_date      <date> 1942-03-08, 1964-07-09, 1971-09-10, 1944-07…
## $ condition_end_date        <date> 1945-05-14, 1970-08-16, 1973-07-25, 1953-06…
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ condition_concept_id      <int> 4, 4, 1, 10, 8, 1, 2, 3, 6, 7, 1, 6, 3, 3, 3…
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 16, 5, 20, 18, 4, 5, 41, 8, 11, 15, 28, 14, …

As well as calculating age, we can also create age groups at the same time. Here we create three age groups: those aged 0 to 17, those 18 to 65, and those 66 or older.

cdm$condition_occurrence <- cdm$condition_occurrence |>
  addAge(
    indexDate = "condition_start_date",
    ageGroup = list(
      "0 to 17" = c(0, 17),
      "18 to 65" = c(18, 65),
      ">= 66" = c(66, Inf)
    )
  )

cdm$condition_occurrence |>
  glimpse()

## Rows: ??
## Columns: 8
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ person_id                 <int> 8662, 5572, 4371, 3898, 2902, 4646, 5185, 14…
## $ condition_start_date      <date> 1942-03-08, 1964-07-09, 1971-09-10, 1944-07…
## $ condition_end_date        <date> 1945-05-14, 1970-08-16, 1973-07-25, 1953-06…
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ condition_concept_id      <int> 4, 4, 1, 10, 8, 1, 2, 3, 6, 7, 1, 6, 3, 3, 3…
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 16, 5, 20, 18, 4, 5, 41, 8, 11, 15, 28, 14, …
## $ age_group                 <chr> "0 to 17", "0 to 17", "18 to 65", "18 to 65"…

By default, when adding age the new column will have been called “age” and will have been calculated using all available information on date of birth contained in the person. We can though also alter these defaults. Here, for example, we impose that month of birth is January and day of birth is the 1st for all individuals.

cdm$condition_occurrence <- cdm$condition_occurrence |>
  addAge(
    indexDate = "condition_start_date",
    ageName = "age_from_year_of_birth",
    ageMissingMonth = 1,
    ageMissingDay = 1,
    ageImposeMonth = TRUE,
    ageImposeDay = TRUE
  )

cdm$condition_occurrence |>
  glimpse()

## Rows: ??
## Columns: 9
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ person_id                 <int> 8662, 5572, 4371, 3898, 2902, 4646, 5185, 14…
## $ condition_start_date      <date> 1942-03-08, 1964-07-09, 1971-09-10, 1944-07…
## $ condition_end_date        <date> 1945-05-14, 1970-08-16, 1973-07-25, 1953-06…
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ condition_concept_id      <int> 4, 4, 1, 10, 8, 1, 2, 3, 6, 7, 1, 6, 3, 3, 3…
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 16, 5, 20, 18, 4, 5, 41, 8, 11, 15, 28, 14, …
## $ age_group                 <chr> "0 to 17", "0 to 17", "18 to 65", "18 to 65"…
## $ age_from_year_of_birth    <int> 16, 5, 20, 18, 4, 5, 41, 8, 11, 15, 28, 14, …

As well as age at diagnosis, we might also want identify patients’ sex. PatientProfiles provides the addSex() function that will add this for us. Because this is treated as time-invariant, we will not have to specify any index variable.

cdm$condition_occurrence <- cdm$condition_occurrence |>
  addSex()

cdm$condition_occurrence |>
  glimpse()

## Rows: ??
## Columns: 10
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ person_id                 <int> 8662, 5572, 4371, 3898, 2902, 4646, 5185, 14…
## $ condition_start_date      <date> 1942-03-08, 1964-07-09, 1971-09-10, 1944-07…
## $ condition_end_date        <date> 1945-05-14, 1970-08-16, 1973-07-25, 1953-06…
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ condition_concept_id      <int> 4, 4, 1, 10, 8, 1, 2, 3, 6, 7, 1, 6, 3, 3, 3…
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 16, 5, 20, 18, 4, 5, 41, 8, 11, 15, 28, 14, …
## $ age_group                 <chr> "0 to 17", "0 to 17", "18 to 65", "18 to 65"…
## $ age_from_year_of_birth    <int> 16, 5, 20, 18, 4, 5, 41, 8, 11, 15, 28, 14, …
## $ sex                       <chr> "Female", "Male", "Male", "Male", "Male", "M…

Similarly, we could also identify whether an individual was in observation at the time of their diagnosis (i.e. had an observation period that overlaps with their diagnosis date), as well as identifying how much prior observation time they had on this date and how much they have following it.

cdm$condition_occurrence <- cdm$condition_occurrence |>
  addInObservation(indexDate = "condition_start_date") |>
  addPriorObservation(indexDate = "condition_start_date") |>
  addFutureObservation(indexDate = "condition_start_date")

cdm$condition_occurrence |>
  glimpse()

## Rows: ??
## Columns: 13
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ person_id                 <int> 8662, 5572, 4371, 3898, 2902, 5185, 1473, 71…
## $ condition_start_date      <date> 1942-03-08, 1964-07-09, 1971-09-10, 1944-07…
## $ condition_end_date        <date> 1945-05-14, 1970-08-16, 1973-07-25, 1953-06…
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 7, 8, 9, 10, 13, 14, 16, 17, …
## $ condition_concept_id      <int> 4, 4, 1, 10, 8, 2, 3, 6, 7, 3, 3, 10, 3, 7, …
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 16, 5, 20, 18, 4, 41, 8, 11, 15, 20, 33, 12,…
## $ age_group                 <chr> "0 to 17", "0 to 17", "18 to 65", "18 to 65"…
## $ age_from_year_of_birth    <int> 16, 5, 20, 18, 4, 41, 8, 11, 15, 20, 33, 12,…
## $ sex                       <chr> "Female", "Male", "Male", "Male", "Male", "F…
## $ in_observation            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ prior_observation         <int> 5910, 2016, 7557, 6764, 1804, 15223, 3154, 4…
## $ future_observation        <int> 4460, 14161, 7005, 4216, 3153, 603, 11242, 2…

For these functions which work with information from the observation table, it is important to note that the results will be based on the observation period during which the index date falls within. Moreover, if a patient is not under observation at the specified date, addPriorObservation() and addFutureObservation() functions will return NA.

When checking whether someone is in observation the default is that we are checking whether someone was in observation on the index date. We could though expand this and consider a window of time around this date. For example here we add a variable indicating whether someone was in observation from 180 days before the index date to 30 days following it.

cdm$condition_occurrence |>
  addInObservation(
    indexDate = "condition_start_date",
    window = c(-180, 30)
  ) |>
  glimpse()

## Rows: ??
## Columns: 13
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ person_id                 <int> 8662, 5572, 4371, 3898, 2902, 5185, 1473, 71…
## $ condition_start_date      <date> 1942-03-08, 1964-07-09, 1971-09-10, 1944-07…
## $ condition_end_date        <date> 1945-05-14, 1970-08-16, 1973-07-25, 1953-06…
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 7, 8, 9, 10, 13, 14, 16, 17, …
## $ condition_concept_id      <int> 4, 4, 1, 10, 8, 2, 3, 6, 7, 3, 3, 10, 3, 7, …
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 16, 5, 20, 18, 4, 41, 8, 11, 15, 20, 33, 12,…
## $ age_group                 <chr> "0 to 17", "0 to 17", "18 to 65", "18 to 65"…
## $ age_from_year_of_birth    <int> 16, 5, 20, 18, 4, 41, 8, 11, 15, 20, 33, 12,…
## $ sex                       <chr> "Female", "Male", "Male", "Male", "Male", "F…
## $ prior_observation         <int> 5910, 2016, 7557, 6764, 1804, 15223, 3154, 4…
## $ future_observation        <int> 4460, 14161, 7005, 4216, 3153, 603, 11242, 2…
## $ in_observation            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

We can also specify a window and require that an individual is present for only some days within it. Here we add a variable indicating whether the individual was in observation at least a year in the future,

cdm$condition_occurrence |>
  addInObservation(
    indexDate = "condition_start_date",
    window = c(365, Inf),
    completeInterval = FALSE
  ) |>
  glimpse()

## Rows: ??
## Columns: 13
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ person_id                 <int> 8662, 5572, 4371, 3898, 2902, 5185, 1473, 71…
## $ condition_start_date      <date> 1942-03-08, 1964-07-09, 1971-09-10, 1944-07…
## $ condition_end_date        <date> 1945-05-14, 1970-08-16, 1973-07-25, 1953-06…
## $ condition_occurrence_id   <int> 1, 2, 3, 4, 5, 7, 8, 9, 10, 13, 14, 16, 17, …
## $ condition_concept_id      <int> 4, 4, 1, 10, 8, 2, 3, 6, 7, 3, 3, 10, 3, 7, …
## $ condition_type_concept_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ age                       <int> 16, 5, 20, 18, 4, 41, 8, 11, 15, 20, 33, 12,…
## $ age_group                 <chr> "0 to 17", "0 to 17", "18 to 65", "18 to 65"…
## $ age_from_year_of_birth    <int> 16, 5, 20, 18, 4, 41, 8, 11, 15, 20, 33, 12,…
## $ sex                       <chr> "Female", "Male", "Male", "Male", "Male", "F…
## $ prior_observation         <int> 5910, 2016, 7557, 6764, 1804, 15223, 3154, 4…
## $ future_observation        <int> 4460, 14161, 7005, 4216, 3153, 603, 11242, 2…
## $ in_observation            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

1.3 Adding characteristics to a cohort tables

The above functions can be used on both standard OMOP CDM tables and cohort tables. Note as the default index date in the functions is “cohort_start_date” we can now omit this.

cdm$cohort1 |>
  glimpse()

## Rows: ??
## Columns: 4
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 3, 1, 1, 2, 2, 2, 2, 3, 3, 3, 2, 2, 2, 3, 2, 1, 1…
## $ subject_id           <int> 2637, 4576, 150, 1161, 5225, 6300, 6014, 6612, 28…
## $ cohort_start_date    <date> 1943-01-02, 1939-03-24, 1937-11-03, 1964-07-12, …
## $ cohort_end_date      <date> 1954-02-17, 1941-02-16, 1946-09-30, 1965-04-26, …

cdm$cohort1 <- cdm$cohort1 |>
  addAge(ageGroup = list(
    "0 to 17" = c(0, 17),
    "18 to 65" = c(18, 65),
    ">= 66" = c(66, Inf)
  )) |>
  addSex() |>
  addInObservation() |>
  addPriorObservation() |>
  addFutureObservation()

cdm$cohort1 |>
  glimpse()

## Rows: ??
## Columns: 10
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 2, 2, 1, 3, 3, 1, 1, 1, 3, 3, 2, 2, 2, 2, 3, 2, 2…
## $ subject_id           <int> 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15, 16, 1…
## $ cohort_start_date    <date> 1981-03-18, 1984-11-05, 1956-05-01, 1952-02-19, …
## $ cohort_end_date      <date> 1994-10-31, 1987-08-07, 1964-01-08, 1954-02-21, …
## $ age                  <int> 24, 18, 11, 5, 29, 32, 33, 32, 23, 16, 18, 25, 6,…
## $ age_group            <chr> "18 to 65", "18 to 65", "0 to 17", "0 to 17", "18…
## $ sex                  <chr> "Male", "Male", "Female", "Female", "Male", "Fema…
## $ in_observation       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ prior_observation    <int> 8842, 6883, 4138, 1875, 10652, 11882, 12262, 1174…
## $ future_observation   <int> 6093, 4665, 6683, 827, 5318, 4833, 4438, 223, 203…

1.4 Getting multiple characteristics at once

The above functions, which are chained together, each fetch the related information one by one. In the cases where we are interested in adding multiple characteristics, we can add these all at the same time using the more general addDemographics() functions. This will be more efficient that adding characteristics as it requires fewer joins between our table of interest and the person and observation period tables.

cdm$cohort2 |>
  glimpse()

## Rows: ??
## Columns: 4
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 1, 2, 2, 2, 3, 2, 1, 1, 2, 2, 2, 3, 2, 1, 1, 2, 1…
## $ subject_id           <int> 151, 5067, 863, 6376, 2765, 6654, 1440, 5777, 220…
## $ cohort_start_date    <date> 1970-03-02, 1944-02-06, 1969-04-27, 1969-05-02, …
## $ cohort_end_date      <date> 1986-03-24, 1945-05-17, 1993-10-30, 1975-04-28, …

tictoc::tic()
cdm$cohort2 |>
  addAge(ageGroup = list(
    "0 to 17" = c(0, 17),
    "18 to 65" = c(18, 65),
    ">= 66" = c(66, Inf)
  )) |>
  addSex() |>
  addInObservation() |>
  addPriorObservation() |>
  addFutureObservation()

## # Source:   table<og_235_1741167140> [?? x 10]
## # Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
##    cohort_definition_id subject_id cohort_start_date cohort_end_date   age
##                   <int>      <int> <date>            <date>          <int>
##  1                    1          1 1973-06-15        1973-08-14         16
##  2                    1          2 1988-04-05        1997-06-27         22
##  3                    3          3 1974-04-08        1974-05-07         29
##  4                    2          4 1949-08-16        1950-11-27          2
##  5                    2          5 1972-09-08        1987-06-27         24
##  6                    3          6 1964-11-08        1977-05-31          8
##  7                    1          7 1949-01-17        1955-03-10         23
##  8                    3          8 1956-01-11        1958-08-23          0
##  9                    2          9 1993-12-22        2005-11-21         21
## 10                    2         10 1957-01-27        1958-01-19         28
## # ℹ more rows
## # ℹ 5 more variables: age_group <chr>, sex <chr>, in_observation <int>,
## #   prior_observation <int>, future_observation <int>

tictoc::toc()

## 0.446 sec elapsed

tictoc::tic()
cdm$cohort2 |>
  addDemographics(
    age = TRUE,
    ageName = "age",
    ageGroup = list(
      "0 to 17" = c(0, 17),
      "18 to 65" = c(18, 65),
      ">= 66" = c(66, Inf)
    ),
    sex = TRUE,
    sexName = "sex",
    priorObservation = TRUE,
    priorObservationName = "prior_observation",
    futureObservation = FALSE,
  ) |>
  glimpse()

## Rows: ??
## Columns: 8
## Database: DuckDB v1.1.0 [root@Darwin 24.3.0:R 4.4.1/:memory:]
## $ cohort_definition_id <int> 1, 1, 3, 2, 2, 3, 1, 3, 2, 2, 1, 3, 1, 3, 3, 2, 2…
## $ subject_id           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16, 17, 20…
## $ cohort_start_date    <date> 1973-06-15, 1988-04-05, 1974-04-08, 1949-08-16, …
## $ cohort_end_date      <date> 1973-08-14, 1997-06-27, 1974-05-07, 1950-11-27, …
## $ age                  <int> 16, 22, 29, 2, 24, 8, 23, 0, 21, 28, 7, 24, 23, 1…
## $ age_group            <chr> "0 to 17", "18 to 65", "18 to 65", "0 to 17", "18…
## $ sex                  <chr> "Male", "Male", "Female", "Female", "Male", "Fema…
## $ prior_observation    <int> 6009, 8130, 10689, 958, 9017, 3234, 8417, 10, 802…

tictoc::toc()

## 0.17 sec elapsed

In our small mock dataset we see a small improvement in performance, but this difference will become much more noticeable when working with real data that will typically be far larger.