Introduction

The traj package implements a clustering algorithm for functional data (henceforth referred to as trajectories) that extends previous work from Leffondré et al [1]. This algorithm is comprised of three steps. The first step sumarizes the main features of the trajectories by computing the 18 measures listed below and detailed in Appendix A.

  1. Maximum
  2. Range (max - min)
  3. Mean value
  4. Standard deviation
  5. Slope of the linear model
  6. \(R^2\): Proportion of variance explained by the linear model
  7. Curve length (total variation)
  8. Rate of intersection with the mean
  9. Proportion of time spent above the mean
  10. Minimum of the first derivative
  11. Maximum of the first derivative
  12. Mean of the first derivative
  13. Standard deviation of the first derivative
  14. Minimum of the second derivative
  15. Maximum of the second derivative
  16. Mean of the second derivative
  17. Standard deviation of the second derivative
  18. Later change/Early change

The second step performs dimensionality reduction on the 18 measures to extract the main features of the trajectories. Specifically,

  1. Measures that are constant across trajectories are discarded because they do not provide any discriminating information.
  2. A principal component analysis is conducted on the measures. The main principal components (defined as those principal components contributing more to the total variance than any of the individual standardized measures) are selected.
  3. A varimax rotation is applied to the main principal components.
  4. The measure which is most strongly correlated with each rotated component is selected, starting from the component that explains the most variance.

In the third step, the k-means algorithm is used to cluster the trajectories based on the measures selected in step 2.

An example

Let us illustrate how to use the traj package on an artificially created dataset (trajdata) comprised of 130 trajectories following four distinct patterns (A, B, C, D).

library(traj)
data(trajdata) 
head(trajdata)
dat <- trajdata[, -c(1:2)]

Each trajectory is made up of six observations and there are no missing values. The function Step1Measures computes the measures. By default, measure 18 (Early change/Later change) is not included in the analysis. This is because, depending on the situation (uneven observation times, attrition, missing values, etc.), there might not be, for each trajectory, a natural midpoint. In the present data set, we include measure 18. By leaving the ‘midpoint’ argument at its default of NULL, the third observation will be taken as the midpoint.

step1 <- Step1Measures(Data = dat, 
                       ID = FALSE, 
                       measures = 1:18, 
                       midpoint = NULL) 

summary(step1)
## Description of the measures:
## m1: Maximum
## m2: Range
## m3: Mean value
## m4: Standard deviation
## m5: Slope of the linear model
## m6: Proportion of variance explained by the linear model (R squared)
## m7: Curve length (total variation)
## m8: Number of times crossing the mean per unit time
## m9: Proportion of time spent under the mean
## m10: Minimum of the first derivative
## m11: Maximum of the first derivative
## m12: Mean of the first derivative
## m13: Standard deviation of the first derivative
## m14: Minimum of the second derivative
## m15: Maximum of the second derivative
## m16: Mean of the second derivative
## m17: Standard deviation of the second derivative
## m18: Early change/Later change
## 
## Summary of measures:
##                m1         m2        m3        m4          m5           m6
## Min.     11.16944   4.974832  6.734208  1.655156 -24.7364594 6.548737e-05
## 1st Qu.  17.44338  13.596531 10.683520  4.608162 -18.0088233 2.119554e-01
## Median   86.89627  80.567344 49.521374 24.392691 -14.3756056 8.825381e-01
## Mean     63.71865  61.281725 37.032177 18.761021  -8.8673769 6.408343e-01
## 3rd Qu.  95.64244  94.429695 55.289514 29.552200   0.1682977 9.319712e-01
## Max.    113.38473 128.477551 61.923002 37.161789  16.2687616 9.903453e-01
##                m7        m8        m9        m10        m11          m12
## Min.     11.01804 0.2000000 0.2000000 -63.352715 -14.935763 -25.69551019
## 1st Qu.  33.25262 0.2000000 0.5000000 -33.047714  -5.681113 -18.46646486
## Median   81.92250 0.2000000 0.5000000 -21.252427   2.160407 -14.91259635
## Mean     70.27963 0.3507692 0.5261538 -20.647836   2.202651  -8.98945884
## 3rd Qu. 102.00426 0.6000000 0.6000000  -5.955160   6.702146   0.04639682
## Max.    137.85654 1.0000000 0.9000000   9.971951  44.466830  17.57402397
##               m13        m14       m15         m16        m17          m18
## Min.     1.243051 -27.917171 -7.326749 -12.5077887  0.6418593 -46.28134932
## 1st Qu.  4.101053 -12.992123  1.919339  -4.0529737  3.1900188  -0.01697665
## Median   6.502801  -9.027665  4.740119  -1.5795261  4.8789085   1.12915597
## Mean     7.207440  -9.280153  6.277813  -1.6084247  5.2729286   1.77652097
## 3rd Qu.  9.685304  -4.659910 10.095925   0.9862174  6.8043731   2.81539207
## Max.    18.868623   2.347271 27.813294  11.9407699 15.3664412  49.83439126
## 
## Outliers before capping:
##   ID    m18
##   23 -177.0
##  110   95.2
## Outliers after capping:
##   ID   m18
##   23 -46.3
##  110  49.8

Two outlying values were detected in measure 18 of trajectories ID23 and ID110 and were capped to more likely values according to the distribution of m18 (see Appendix B). Once the measures are computed, we use the Step2Selection function to extract the measures that best characterize the trajectories.

step2 <- Step2Selection(trajMeasures = step1) 
summary(step2)
## The measures m4, m12 were discarded because they were perfectly or almost perfectly correlated with another measure. Upon forming the principal components from the remaining measures, 4 of them had a variance greater than any one of the normalized measures. Together, they explained 83% of the total variance. A varimax rotation was performed to maximize the correlation with the original measures without affecting the proportion of explained variance. For each rotated factor, the measure that had the highest correlation (loading) with it was selected. As a result of this procedure, the selected measures are, in decreasing order of variance explained, m6, m14, m11, m15. Use print() to see more detailed informations.

Two measures are defined as “perfectly or almost perfectly correlated” if the absolute value of their Pearson correlation coefficient is greater than 0.98. The print function provides more detailed information:

print(step2)
## m4 has been discarded due to being perfectly or almost perfectly correlated with m2.
## m12 has been discarded due to being perfectly or almost perfectly correlated with m5.
## 
## In decreasing order of variance explained, the selected measures are m6, m14, m11, m15.
## 
## Loadings:
##     RC1    RC4    RC3    RC2   
## m1   0.807  0.411 -0.348       
## m2   0.738  0.520 -0.393       
## m3   0.850  0.390 -0.261       
## m5  -0.295 -0.391  0.826       
## m6   0.863  0.248 -0.317       
## m7   0.646  0.628 -0.366       
## m8  -0.831 -0.131  0.216       
## m9   0.446         0.186  0.149
## m10 -0.272 -0.603  0.699       
## m11                0.952 -0.118
## m13  0.416  0.811 -0.100       
## m14 -0.213 -0.871         0.314
## m15  0.106  0.175         0.959
## m16 -0.135 -0.470         0.846
## m17  0.218  0.809         0.488
## m18  0.184        -0.316       
## 
##                  RC1   RC4   RC3   RC2
## SS loadings    4.462 3.908 2.860 2.045
## Proportion Var 0.279 0.244 0.179 0.128
## Cumulative Var 0.279 0.523 0.702 0.830

Measure 4 (Standard deviation) was dropped because it is perfectly or almost perfectly correlated with measure 2 (Range) and measure 12 (Mean of the first derivative) was dropped because it is perfectly or almost perfectly correlated with measure 5 (Slope of the linear model). The Step3Clusters function uses the k-means algorithm (function stats:::kmeans) on the measures selected in step 2 to cluster the trajectories.

library(cluster)
set.seed(141114)
step3 <- Step3Clusters(trajSelection = step2, B = 20, nclusters = NULL) # B = 20 is used in the interest of time; the default is B = 500.

By setting nclusters to NULL, Step3Clusters uses the criterion proposed by Tibshirani et al. [2] to select the optimal number of clusters. In our case, this gives \(k=4\) as can be inspected visually with

par(mfrow = c(1, 1))
plot(step3, which.plots = 1, ask = FALSE)

To visually inspect the classification, we write plot(step3, ask = TRUE).

The “Sample trajectories” plot tends to get cluttered when there are too many clusters. In any case, it is always a good idea to plot the whole clusters:

color.pal <- palette.colors(palette = "Okabe-Ito", alpha = 1)
par(mfrow = c(1, 1))
for(k in 1:4){
  w <- which(step3$partition$Cluster == k)
  dat.w <- dat[w, ]
  plot(y = 0, x = 0, ylim = c(floor(min(dat)), ceiling(max(dat))), xlim = c(1,6), xlab="", ylab="", type="n", main = paste("Cluster ", k, " (n = ", step3$partition.summary[k], ")", sep = ""))
  for(i in 1:length(w)){
    lines(y = dat.w[i, ], x = 1:6, col = color.pal[k])
  }
}


Appendix A: The measures

In this section, we expand on how the eighteen measures are computed. Let \(y=y(t)\) denote a continuous function \([a,b]\rightarrow \mathbb{R}\) and let \(y(t_i)\) denote the trajectory obtained by measuring \(y(t)\) at times \(a\leq t_1<\ldots< t_N\leq b\), where \(N\geq 3\). We do not assume that the times \(t_i\) are equidistant from one another.

Appendix B: The capping procedure

In a recent paper published on the arXiv [3], the author proves that for any continuous random variable \(X\) with finite first two moments there holds \[ P[|X-\mu|>k\sigma]<\sigma M(k)\alpha(k), \]where \(M(k)\) is the least upper bound of the probability density function on \(\{x \ | \ |x-\mu|>k\sigma\}\) and where \(\alpha(k)\) is the unique real root of the cubic polynomial\[t^3 + 2\pi kt^2 + 2\pi e k^2t - \frac{2\pi e}{\sigma M(k)}.\]Suppose that our data set contains \(n\) trajectories. This gives us a sample \(X_1,\ldots,X_n\) from the distribution of \(X\). Our capping procedure consists of the following steps.

  1. After relabeling the observed values of \(X\) so that \(|X_1|\leq |X_2|\leq\ldots \leq |X_n|\), remove the last \(r=\min(1,n/100)\) observations.

  2. From the remaining values \(X_1,\ldots, X_{n-r}\), we compute estimates \(\hat{\mu}\) and \(\hat{\sigma}\) of the mean and standard deviation as usual.

  3. Still using only \(X_1,\ldots, X_{n-r}\), approximate the probability density function of \(X\) using a kernel density estimator (function stats:::density), find the value of \(M(k)\) for this PDF and compute \(\alpha(k)\).

  4. Using these, identify the smallest value of \(k\) for which \(\hat{\sigma}M(k)\alpha(k)<0.003\). If \(k^*\) denotes this smallest value of \(k\), replace the value of any \(X_i\) with \(X_i<\hat{\mu}-k^*\hat{\sigma}\) by \(\hat{\mu}-k^*\hat{\sigma}\) and we replace the value of any \(X_i\) with \(X_i>\hat{\mu}+k^*\hat{\sigma}\) by \(\hat{\mu}+k^*\hat{\sigma}\) .

Bibliography

[1] Leffondré K, Abrahamowicz M, Regeasse A, Hawker GA, Badley EM, McCusker J, Belzile E. Statistical measures were proposed for identifying longitudinal patterns of change in quantitative health indicators. J Clin Epidemiol. 2004 Oct; 57(10):1049-62. doi: 10.1016/j.jclinepi.2004.02.012. PMID: 15528056.

[2] Tibshirani R, Walther G, Hastie T. Estimating the Number of Clusters in a Data Set Via the Gap Statistic, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 63, Issue 2, July 2001, Pages 411–423

[3] Nishiyama T. Improved Chebyshev inequality: new probability bounds with known supremum of PDF arXiv:1808.10770v2