This vignette gives you the knowledge you need to create your own
diseasystore
.
Once you have familiarised yourself with the concepts, you can
consult the vignette("extending-diseasystore-example")
,
where we go through how a individual-level diseasystore
can
be implemented.
To begin, we go through the data model used within the
diseasystores
. It is this data model that enables the
automatic coupling of features and powers the package.
The data created by diseasystores
are so-called
“bitemporal” data. This means we have two temporal dimensions. One
representing the validity of the record, and one representing the
availability of the record.
valid_from
and valid_until
The validity dimension indicates when a given data point is “valid”, e.g. a hospitalisation is valid between admission and discharge date. This temporal dimension should be familiar to you is simply “regular” time.
We encode the validity information into the columns
valid_from
and valid_until
such that a record
is valid for any time t
which satisfies
valid_from <= t < valid_until
. For many features, the
validity is a single day (such as a test result) and the
valid_until
column will be the day after
valid_from
.
By convention, we place these column as the last columns of the table1.
from_ts
and until_ts
diseasystore
uses {SCDB}
in the background
to store the computed features. {SCDB}
implements the
second temporal dimension which indicates when a record was present in
the data. This information is encoded in the columns
from_ts
and until_ts
. Normally, you don’t see
these columns when working with diseasystore
since they are
masked by {SCDB}
. However, if you inspect the tables
created in the database by diseasystore, you will find they are present.
For our purposes, it is sufficient to know that these column gives a
time-versioned data base where we can extract previous versions through
the slice_ts
argument. By supplying any time τ
as slice_ts
, we get the data as they were available on that
date. This allows us to build continuous integration of our features
while preserving previously computed features.
A primary feature of diseasystore
is its ability to
automatically couple and aggregate features. This coupling requires
common “key_*” columns between the features. Any feature in a
diseasystore
therefore must have at least one “key_*”
column. By convention, we place these column as the first columns of the
table.
Finally, we come to the main data of the diseasystore
,
namely the features. First, a reminder that “feature” here comes from
machine learning and is any individual piece of information.
We subdivide features into two categories: “observables” and “stratifications”. On most levels, these are indistinguishable, but their purposes differ and hence we need to handle them individually.
To see the available features of a diseasystore
, you can
use the ?DiseasystoreBase$available_features()
method.
In diseasystore
any feature whose name starts with “n_”
is treated as “observables” (by default). For specific
diseasystores
, the naming convention may differ. From a
modelling perspective, these observables are typically the metrics you
want to model or take as inputs to inform your model.
To see the available observables in a diseasystore
, you
can use the ?DiseasystoreBase$available_observables()
method.
Conversely, any other feature is a “stratification” feature. These features are the variables used to subdivide your analysis to match the structure of your model (hence why they are called stratification features).
A prominent example for most disease models would be a stratification feature like “age_group”, since most diseases show a strong dependency on the age of the affected individuals.
To see the available observables in a diseasystore
, you
can use the ?DiseasystoreBase$available_stratifications()
method.
While there is no formal requirement for the naming of the
observables or stratifications, it is considered best practice to use
the same names as other diseasystores
for features where
possible2. This simplifies the process of adapting
analyses and disease models to new diseasystores
.
To facilitate the automatic coupling and aggregation of features, we
use the ?FeatureHandler
class. Each feature3 in the
diseasystore
has an associated ?FeatureHandler
which implements the computation, retrieval and aggregation of the
feature.
The ?FeatureHandler
defines a
?FeatureHandler$compute()
function which must be on the
form:
compute = function(start_date, end_date, slice_ts, source_conn, ...)
The arguments start_date
and end_date
indicates the period for which features should be computed. The
diseasystores
are dynamically expanded,
so feature computation is often restricted to limited time intervals as
indicated by start_date
and end_date
.
As mentioned above
slice_ts
specifies what date the should be computed for.
E.g. if slice_ts
is the current date, the current features
should be computed. Conversely, if slice_ts
is some past
date, features corresponding to this date should be computed.
Lastly, the source_conn is a flexible argument passed to the FeatureHandler indicating where the source data needed to compute the features is stored (e.g. a database connection or directory).
Note that multiple features can be computed by a single
?FeatureHandler
. For example, you may decide that it is
more convenient for compute multiple different features simultaneously
(e.g. a hospitalisation and the classification of said hospitalisation
or a test and the associated test result).
When ?FeatureHandler$compute()
is called by the
diseasystore
, it also passes a reference to itself as
ds
via the ...
argument. This means that if
the implementation of ?FeatureHandler$compute()
needs
access to other features to compute the given feature, the compute
function can pick up the ds
reference adding it to the
function signature:
compute = function(start_date, end_date, slice_ts, source_conn, ds, ...)
And then use
ds$get_feature(<feature>, start_date = start_date, end_date = end_date, slice_ts = slice_ts)
to retrieve the necessary features for the computation.
The ?FeatureHandler
defines a
?FeatureHandler$get()
function which must be in the
form:
get = function(target_table, slice_ts, target_conn)
Typically, you do not need to specify this function since the default
(a variant of SCDB::get_table()
) always works.
However, in the case that you do need to specify it, the
target_table
argument will be a DBI::Id
specifying the location of the data base table where the features are
stored. target_conn
is connection to the database. And as
above, slice_ts
is the time-keeping variable.
The ?FeatureHandler
defines a key_join
function which must be on the form:
key_join = function(.data, feature)
In most cases, you should be able to use the bundled
key_join_*
functions (see ?aggregators
for a
full list).
In the event, that you need to create your own aggregator the arguments are as follows:
.data
is a grouped data.frame
whose
groups are those specified by the stratification
argument
(see Automatic
aggregation).
feature
is the name of the feature(s) to
aggregate.
Your aggregator should return a dplyr::summarise()
call
that operates on all columns specified in the feature
argument.
By now, you should know the basics of creating your own
?FeatureHandler
s.
For a detailed walkthrough on creating a diseasystore
,
see the vignette("extending-diseasystore-example")
.
To see some other ?FeatureHandler
s in action, you can
consult a few of those bundled with the diseasystore
package.
For example:
diseasystore
With the knowledge of how to build custom
?FeatureHandlers
, we turn our attention to the remaining
parts of the diseasystore
’s anatomy.
The diseasystores
are R6 classes which is a
implementation of object-oriented (OO) programming. To those unfamiliar
with OO programming, the diseasystores
are single “objects”
with a number of “public” and “private” functions and variables. The
public functions and variables are visible to the user of the
diseasystore
with the private functions and variables are
visible only to us (the developers).
When extending diseasystore
, we are only writing private
functions and variables. The public functions and variables are handled
elsewhere4.
The ds_map
field of the diseasystore
tells
the diseasystore
which ?FeatureHandler
is
responsible for each feature, thus allowing the
diseasystore
to retrieve the features specified in the
observable
and stratification
arguments of
calls to ?DiseasystoreBase$get_feature()
.
In other words, it maps the names of features to their corresponding
?FeatureHandlers
.
As we saw above, a ?FeatureHandler
may compute more than
a single feature. Each feature should be mapped to the
?FeatureHandler
here or else the diseasystore
will not be able to automatically interact with it.
By convention, the name of the ?FeatureHandler
should be
snake_case and contain a diseasystore
specific prefix
(e.g. for ?DiseasystoreGoogleCovid19
, all
?FeatureHandlers
are named
“google_covid_19_
These names are used as the table names when storing the features in the database, and the prefix helps structure the database accordingly.
This latter part becomes important when clean up for the data base needs to be performed.
By default, any feature whose name starts with “n_” is treated as an
observable feature. To override this behaviour, you can specify the
regex pattern $observables_regex
to match the names of the
observable features in your case.
The diseasystore
are made to be as flexible as possible
which means that it can incorporate both individual level data and
semi-aggregated data. For semi-aggregated data, it is often the case
that the data includes aggregations at different levels, nested within
the data.
For example, the Google COVID-19 data repository contains information
on both country-level and region-level in the same data files. When the
user of ?DiseasystoreGoogleCovid19
asks to get a feature
stratified by, for example, “country_id”, we need to filter out the data
aggregated at the region level.
This is the purpose of
?DiseastoreBase$key_join_filter()
. It takes as input the
requested stratifications and filters the data accordingly after the
features have been joined inside the diseasystore
.
For an example, you can consult DiseasystoreGoogleCovid19: key_join_filter
diseasystore
The diseasystore
package includes the function
test_diseasystore()
to test the diseasystores
.
You can see how to call the testing suite in action with
?DiseasystoreGoogleCovid19
as an example here.
To allow the diseasystores
to be used programmatically,
we expose the period of data availability for each
diseasystore
. These are defined in the
$.min_start_date
and $.max_end_date
private
fields of the diseasystore
.
In some cases, the diseasystores
may not be compatible
with all database backends. For example, the bundled
DiseasystoreSimulist
(see
vignette("extending-diseasystore-example")
) is not
compatible with SQLite due to lack of date support.
In this case, we add a check to the initialize
method of
the diseasystore
to ensure that the database backend is
compatible with the diseasystore
.
initialize = function(...) {
super$initialize(...)
# We do not support SQLite for this diseasystore since it has poor support
# for date operations
checkmate::assert_disjunct(class(self$target_conn), "SQLiteConnection")
...
}
The {SCDB}
package places
checksum
, from_ts
, and until_ts
as the last columns. But valid_from
and
valid_until
should be the last columns in the output passed
to SCDB
.↩︎
In practice, this means that the names of features
should be in snake_case
.↩︎
Or “coupled” set of features as we will soon see.↩︎
By the ?DiseasystoreBase
class.↩︎