See https://www.mdpi.com/2227-7390/9/23/3074 for more details about the mathematical background and an overview of the package.
We simulate the Jukes Cantor model of nucleotide replacement.
Denote the set of nucleotides: \(E = \{A,
T, G, C\}\). A DNA sequence is seen as a string of nucleotides:
ATGCATTAC
.
Each in the sequence is subject to mutation over time.
In the Jukes-Cantor model (1969), each site is a Markovian jump process with continuous time having as generator:
\[ Q = \ \ \ \begin{array}{c|cccc} &A&T&G&C \\ \hline A&-3\alpha&\alpha&\alpha&\alpha\\ T&\alpha&-3\alpha&\alpha&\alpha\\ G&\alpha&\alpha&-3\alpha&\alpha\\ C&\alpha&\alpha&\alpha&-3\alpha\\ \hline \end{array}\]
for some \(\alpha >0\).
Let assume that the process is observed over the period \([0,T_{\max}]\) in \(n\) sites with the nucleotide \(A\) at time \(t=0\). In the following, nucleotides are recoded: \(A\leftrightarrow 1\), \(T\leftrightarrow 2\), \(G\leftrightarrow 3\), \(C\leftrightarrow 4\).
K <- 4
PJK <- matrix(1 / 3, ncol = K, nrow = K) - diag(rep(1 / 3, K))
lambda_PJK <- c(1, 1, 1, 1)
Tmax <- 10
n <- 100
d_JK <- generate_Markov(n = n, K = 4, P = PJK, lambda = lambda_PJK, Tmax = Tmax, labels = as.factor(c("A", "C", "G", "T")))
head(d_JK, 10)
## id time state
## 1 1 0.0000000 A
## 2 1 0.1983368 G
## 3 1 0.2365287 T
## 4 1 1.7001558 G
## 5 1 2.1102854 A
## 6 1 2.5485099 T
## 7 1 3.2633724 G
## 8 1 3.3448084 A
## 9 1 3.4654739 T
## 10 1 4.9614496 C
The dataset is a data.frame with 3 columns: id
,
time
, state
. id
contains the id
of the different individuals (usually integers). time
contains the time values at which a change occurs, note that for each
individual, time values are ordered and start at 0. state
contains the state that appears at the given time, state must be
characters, factors, integers. This data format is used by the most part
of cfda’s functions.
## Number of rows: 1082
## Number of individuals: 100
## Time Range: 0 - 9.993531
## Same time start value for all ids: TRUE
## Same time end value for all ids: FALSE
## Number of states: 4
## States:
## A, C, G, T
## Number of individuals visiting each state:
## A C G T
## 100 98 97 92
We can compute the duration of each trajectory.
## 1 2 3 4 5 6
## 9.976904 7.812600 8.871661 8.357921 9.650790 9.993531
All individuals has a different length. The computation of an optimal encoding requires that the end time of each individual is the same (\(T_{\max}\)), this can be done with the following function:
Individuals can be plotted:
Generally, in categorical functional data analysis, the following basic statistics are computed.
For each individual, the time spent in each of the \(K\) states is computed.
## A C G T
## 1 2.827888 0.7229735 1.929005 4.520133
## 2 3.951755 0.3545537 1.769637 3.924055
## 3 1.958321 1.1951070 6.846572 0.000000
## 4 4.718945 2.0622661 2.005165 1.213624
## 5 3.200914 0.1618662 2.296325 4.340895
## 6 4.214955 1.2124305 2.295326 2.277289
## 7 2.757877 4.3517045 1.038824 1.851595
## 8 3.826478 2.1324087 4.041113 0.000000
## 9 1.687807 2.7519813 1.208147 4.352065
## 10 4.788120 1.2389632 3.972917 0.000000
The results can be plotted:
For each individual, the number of jumps occurring in \([0,T_{\max}]\) is computed.
## 1 2 3 4 5 6
## 17 9 3 13 10 13
The results can be plotted:
An other interesting statistic is the evolution of the probability to be in each state.
## 0 0.008 0.038 0.039 0.045 0.058 0.058 0.062 0.07 0.078
## A 1 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91
## C 0 0.01 0.01 0.01 0.01 0.01 0.02 0.02 0.03 0.04
## G 0 0.00 0.00 0.01 0.01 0.01 0.01 0.02 0.02 0.02
## T 0 0.00 0.01 0.01 0.02 0.03 0.03 0.03 0.03 0.03
## [1] 0.000000000 0.007897421 0.038151747 0.038873269 0.044793148 0.058217068
The output is a list with two elements: t
the different
time values and pt
a matrix where each row contains the
probability to be in a given state for all the time values. The result
can be plotted:
The transitions between states can been studied by computing a frequency table counting the number of times each pair of states were observed in successive observation times.
## A C G T
## A 0 110 85 105
## C 71 0 84 78
## G 72 65 0 78
## T 86 81 67 0
Assuming that the data follows a Markov process, parameters \(P\) (transition probability matrix) and \(\lambda\) of the process can be estimated.
## $P
## A C G T
## A 0.0000000 0.3666667 0.2833333 0.3500000
## C 0.3047210 0.0000000 0.3605150 0.3347639
## G 0.3348837 0.3023256 0.0000000 0.3627907
## T 0.3675214 0.3461538 0.2863248 0.0000000
##
## $lambda
## A C G T
## 1.062606 1.086185 1.079873 1.123472
##
## attr(,"class")
## [1] "Markov"
The estimated parameters are closed to the ones used. The results can be plotted through a transition graph where each node corresponds to a state with the mean time spent within (corresponding to \(1/\lambda\)) and arrows correspond to transition probabilities between states.
\(X\) is a continuous stochastic process with jumps. We define categorical functional data as a set of sample paths of \(X\).
cfda is seen as an extension of the multiple correspondence analysis to a stochastic process.
The idea is to find a scalar real random variable \(z \in L_2(\Omega)\) that is the most correlated to \({X} = \{X_{t}\ : \ t\in [0,T_{\max}]\}\).
For a fixed \(t\in [0,T_{\max}]\), let \(E^{t}\) the projection operator, \[E^{t}(z) = \mathbb{E}(z|X_{t}) = \sum_{x\in S}{\mathbb{E}(z|X_t=x) \mathbf{1}_{X_t=x}}\] Then, the correlation coefficient between \(z\) and \(X_{t}\) is: \[\eta^{2}(z, X_{t}) = \frac{var(\mathbb{E}(z|X_{t}))}{var(z)}\]
For \(t, s \in [0,T_{\max}]\), the (simple) correspondence analysis looks for \(z\) such that it maximizes \[\displaystyle\frac{1}{2}\left(\eta^{2}(z, X_{t})+\eta^{2}(z, X_{s})\right)\] and the solution is:
\[(E^{t}+E^{s})z = \lambda z\] \(z\) is called a principal component.
As an extension, the functional correspondence analysis is defined as the solution of the optimization problem:
\[ \arg\max_{z}\displaystyle\frac{1}{T_{max}}\int_{0}^{T_{max}}\eta^{2}(z, X_{t})dt\] Solution: \[\int_{0}^{T_{max}}E^{s}z\ ds = \lambda z\]
Denote \(\xi_t = E^{t}z\),
\[\int_0^{T_{\max}}K(t,s)\xi_s\ ds = \lambda \xi_t, \ \ \ \ \mbox{with}\ K(s,t) = E^tE^s. \]
It follows that, \[\xi_{t} = \sum_{x\in S}a_{x}(t)\mathbf{1}_{X_t=x}\] with \(a_x(t) =\mathbb{E}(z|X_t= x), \ \ \forall x \in S\).
For each \(x\in S\), the functions \(a_x: [0,T_{\max}]\rightarrow \mathbb{R}\) are called optimal encoding of the state \(x\).
The eigenvalue problem becomes in terms of encodings:
\[\lambda a_x(t) = \sum_{y\in S}\displaystyle\int_{0}^{T_{\max}}\frac{p_{x,y}(t,s)}{p_x(t)}a_y(s)ds, \ \ \ \ \ \forall x \in S, \forall t\in [0,T_{\max}], \] where \(p_{x,y}(t,s) = \mathbb{P}(X_t=x, X_s=y)\) and \(p_x(t)= \mathbb{P}(X_t=x)\).
Under general conditions (continuity in probability of \(X\)), there exists \(\{\lambda_i\}_{i\geq1}\) positive eigen-values and \(\{a_x^i\}_{i\geq 1}\) eigen-functions.
The following expansion formula holds (Mercer thm.):
\[p_{x,y}(t,s) = p_x(t)p_y(s)\sum_{i\geq 1}\lambda_{i}a^i_x(t)a^i_y(s)\] For \(x=y\), one obtains: \[p_x(t) =\displaystyle\frac{1}{\displaystyle\sum_{i\geq 1}\lambda_{i}a^i_x(t)a^i_y(s)}\]
For the process \(X = \{X_t, t\in[0, T]\}\) throughout the indicators \(1_x = \{1_x(t), t\in [0, T]\}\): \[1_x(t)=\sum_{i\geq 1} z_i a_i^x(t)\frac{1}{p_x(t)}\]
We are interested in approximating the encoding functions, \(a_x\), by
\[a_{x} \approx \sum_{i=1}^m \alpha_{x,i}\phi_i, \] where \(\mathbf{\alpha_x} = (\alpha_{x,1}, \ldots, \alpha_{x,m}) \in \mathbb{R}^m\) are the expansion coefficients onto a basis of functions defined on \([0,T]\), \(\{\phi_1, \ldots, \phi_m\}\), \(m \geq 1\).
The main result (Deville, 1982) is that \(\alpha = (\alpha_{1}, \ldots, \alpha_{K}) \in \mathbb{R}^{Km}\) is the solution of the following eigen-value problem:
\[F^{-1}G\alpha = \lambda\alpha, \] where \(G\) and \(F\) the matrix defined by: \[G = cov(\{V_{(x,i)}\}_{x\in S; i=1:m }),\] \[F = \mathbb{E}\left(\{F_{(x,i),(y,j)}\}_{x,y\in S ; i,j=1:m}\right)\] with the random variables \(V_{(x,i)}\) and \(F_{(x,i),(x,j)}\) defined by: \[V_{(x,i)} = \int_{0}^{T_{\max}}\phi_i(t)\mathbf{1}_xdt \ \ \ \ \mbox{and}\ \ \ \ F_{(x,i), (y,j)} = \int_{0}^{T_{\max}}\phi_i(t)\phi_{j}(t)\mathbf{1}_x\mathbf{1}_y dt\]
Firstly, a basis of functions must be defined. We choose a B-splines basis of order 4.
The optimal encoding is computed using:
## #### FMCA
##
## ## Data
## Number of individuals: 100
## Number of states: 4
## Time Range: 0 to 10
## States: 1 2 3 4
##
## ## Basis
## Type: bspline
## Number of basis functions: 20
##
## ## Outputs
## Eigenvalues:
## 1.979971 1.87976 1.645708 1.497216 1.370031 1.316879
##
## Explained variance:
## 0.078 0.152 0.217 0.276 0.33 0.382
##
## Optimal encoding:
## A C G T
## [1,] 0.008180893 -0.58909343 0.4766992 -0.40020013
## [2,] 0.014515597 0.52566581 -0.6950777 0.17637761
## [3,] -0.009084061 -0.37942555 0.3985981 -0.02443008
## [4,] -0.069652888 -0.67642599 0.3465354 0.57931109
## [5,] -0.137618298 -0.14330188 0.2425508 0.40511150
## [6,] -0.309208593 0.05087257 -0.1190275 0.32635795
##
## Principal components:
## [,1] [,2] [,3] [,4] [,5] [,6]
## 1 0.3704565 -0.7461855 -2.3826967 -0.817332429 0.06146912 -0.04419096
## 2 0.6649154 -1.0872446 -1.8537400 0.007899832 -0.12504207 0.22016509
## 3 -3.8819516 1.2971014 0.4180643 -0.183369945 0.75743237 0.31311758
## 4 1.0941084 0.2797036 -0.6957583 1.762177272 0.54104766 2.23782009
## 5 0.1747660 -2.4542637 -1.5757551 -0.288176353 -0.70892187 0.30844587
## 6 0.1038436 1.1451010 -1.5909369 0.362205655 -0.75526301 -0.73604342
##
## Total elapsed time: 42.303 s
By default this function computes bootstrap estimates of the encoding
functions in order to have a confidence interval. This is controlled by
the computeCI
arguments. The output is a list containing
the different elements computed during the process:
eigenvalues
, pc
, alpha
,
F
, G
, V
and
basisobj
. Note that alpha
contains the
coefficients of the different encoding functions and pc
the
principal components. These components can be used with classical
statistic methods (k-means, regression…).
The eigenvalues can be computed using:
The first encoding function coefficients \(\alpha\)’s:
## A C G T
## [1,] 0.008180893 -0.58909343 0.47669920 -0.400200126
## [2,] 0.014515597 0.52566581 -0.69507769 0.176377614
## [3,] -0.009084061 -0.37942555 0.39859810 -0.024430084
## [4,] -0.069652888 -0.67642599 0.34653541 0.579311087
## [5,] -0.137618298 -0.14330188 0.24255075 0.405111498
## [6,] -0.309208593 0.05087257 -0.11902753 0.326357951
## [7,] -0.345337542 0.34510999 -0.01194049 0.059414703
## [8,] 0.411515257 0.18633306 -1.01542291 -0.004035332
## [9,] 0.066499135 0.38956072 -0.75060425 0.169082665
## [10,] 0.442805356 0.32496793 -0.64164483 0.058034297
## [11,] 0.137025350 0.66080911 -0.88516584 -0.033464535
## [12,] 0.221128180 0.59734105 -0.61281881 -0.226002336
## [13,] -0.023061199 0.41607362 -0.56939698 -0.137096230
## [14,] -0.136462856 0.75635141 -0.34679873 0.019760886
## [15,] -0.004898823 -0.45158932 -0.17783513 0.687183675
## [16,] 0.211932955 0.14230107 -0.31096835 0.028668625
## [17,] -0.164553063 -0.25713493 -0.23598583 0.760837273
## [18,] -0.208792890 0.08073523 0.02635330 0.104904414
## [19,] -0.166480097 -0.27899968 -0.25899954 0.692396872
## [20,] -0.236574942 -0.06124286 -0.02169858 0.379020810
The resulting encoding can be plotted:
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).
or extracted as a fd object (or matrix):
Note that the first encoding mainly oppose the G state (in negative values) to other states. So, on the first component, individuals with a large negative values will tend to spend time in the G state after time 2.5.
By default, it plots (and returns) the first encoding functions.
Other encoding can be accessed with the harm
argument.
A confidence interval can be plotted using the addCI
argument:
Plot the two first components:
We can see that the individuals 3 and 14 have extreme ngative values for the first component and are opposed to 67 and 84. By plotting them, we can check the statement made on the first component meaning.
The components can be used for some other tasks such as clustering, predicting,…
With the encodings, the indicators can be reconstructed as follows:
## [1] "Reconstruct data using 60 components (out of 80)"
## time id stateA stateC stateG stateT state
## 1 0.000000000 1 1.231208 5.258831e-05 0.06839379 -0.2284427 G
## 2 0.007897421 1 1.216876 5.248673e-05 0.07177822 -0.2270982 G
## 3 0.038151747 1 1.123116 4.903944e-05 0.08600745 -0.2066952 G
## 4 0.038873269 1 1.120306 4.888438e-05 0.08636713 -0.2058505 G
## 5 0.044793148 1 1.096537 4.746114e-05 0.08934315 -0.1981844 G
## 6 0.058217068 1 1.038911 4.316210e-05 0.09619778 -0.1755988 G
We see that the reconstructed indicators gives the same inforamtion as the true ones. Using more basis functions for computing encoding or other basis can improve the approximation.