A guide on replacing stats::hclust with hclust1d

This document

The purpose of this vignette is to provide guidelines on replacing stats::hclust calls with hclust1d calls for univariate (1D) data, in a plug-and-play manner, i.e. without changing any surrounding code (or little of the surrounding code, as an option to the programmer).

To enable use of hclust1d you need to include this line in your script or markdown notebook:

library(hclust1d)

In case of packages you need to import hclust1d in your DESCRIPTION file.

Why bother?

The main reason a programmer might want to replace stats:hclust calls with hclust1d in case of clustering 1D points is that the computational complexity of hclust1d is \(\mathcal{O}(n\log n)\), while it is at least quadratic (and \(\mathcal{O}(n^2\log n)\) for case of general linkage) for multidimensional algorithms.

The better algorithmic time complexity (compared to multidimensional hierarchical clustering) paired with its efficient C++ implementation make hclust1d very fast. The computational time beats stats::hclust on all sizes of data and is en par with fastcluster::hclust with small data sizes. However, it is of orders of magnitude faster than both multivariate clustering routines on larger data sizes.

Introduction

Compatibility of hclust1d with stats::hclust

Maintaining compatibility of hclust with stats::hclustwas high on a list of design priorities for hclust1d.

  1. All linkage functions of stats::hclust are supported in hclust1d, too.

  2. Input to stats::hclust should be dist S3 class structure as produced by stats::dist function and the same input is accepted in hclust1d (with a distance argument explicitly set to TRUE).

  3. There are three atypical linkages in stats::hclust. Namely, stats::hclust requires that the squared dist structure is provided for ward.D, centroid and median linkage functions. This is implicit. The same input is accepted in hclust1d (with distance and squared arguments explicitly set to TRUE).

  4. The object returned from hclust1d call is the same S3 class as the result of stats::hclust, namely hclust S3 class.

  5. The heights returned from hclust1d are calculated the same, as in stats::hclust:

    1. squared respective distances are returned for ward.D, centroid and median linkage functions,
    2. unsquared respective distances are returned for the remaining linkage functions.

Choosing a linkage function in hclust1d

The list of all linkage functions supported in hclust1d is available by calling:

supported_methods()
#> [1] "complete"    "average"     "centroid"    "true_median" "median"     
#> [6] "mcquitty"    "ward.D"      "ward.D2"     "single"

The in-depth description of the linkage functions in hclust1d, together with the inter-cluster distance metric definition used in case of each linkage function (and returned as the merging height) can be found in our getting started vignette.

The choice of a linkage function is the same in hclust1d as in stats::hclust, i.e. by specifying a method argument and passing the name of a linkage function into hclust1d as a character string.

To provide an example, the following two calls execute average linkage hierarchical clustering on distances computed for a set of 1D points, by passing method = "average" argument to relevant calls:

points <- rnorm(10)

res <- stats::hclust(stats::dist(points), method = "average")
res <- hclust1d(stats::dist(points), method = "average", distance = TRUE)

Choosing a distance metric in hclust1d

The user of stats::hclust and of hclust1d can select a number of distance metrics when building distance-based input with stats::dist, by selecting an appropriate name of a metric and passing it as a method argument to stats::dist as a character string. Not all of them are supported in hclust1d. The list of distance metrics supported in hclust1d is available by calling:

supported_dist.methods()
#> [1] "euclidean" "maximum"   "manhattan" "minkowski"

The trick here is that for 1D points euclidean, maximum, manhattan and minkowski distances are equivalent.

To provide an example, the following two calls execute average linkage hierarchical clustering on distances computed by minkowski \(L_3\) norm for a set of 1D points, by passing method = "minkowski" and p = 3 arguments to relevant stats::dist calls:

points <- rnorm(10)

res <- stats::hclust(stats::dist(points, method = "minkowski", p=3), method="average")
res <- hclust1d(stats::dist(points, method = "minkowski", p=3), method="average", distance = TRUE)

Exceptions

We don’t support members argument in hclust1d.

Replacing stats::hclust in case of ward.D, centroid or median linkage functions

This section DOES NOT apply to ward.D2 linkage function, despite the similarity in its name.

For those three linkage functions (ward.D, centroid or median) the default hierarchical clustering routine stats::hclust requires squared distance structure calculated on original points. Consequently, as can be seen from the sections above, to replace a stats::hclust call with hclust1d for 1D data, one needs to replace any call to

res <- stats::hclust(squared_d, method = linkage_function_name, members = NULL)

by a call to

res <- hclust1d(squared_d, method = linkage_function_name, distance = TRUE, square = TRUE)

Somewhere in the code above this line, squared_d has been computed by a call to stats::dist from a vector of 1D points and subsequently squaring the stats::dist result.

Somewhere below in the code res gets analyzed, but it is OK, because the results of both calls are compatible.

Optional changes to the surrounding code

Replacing stats::hclust in case of all other linkage functions, besides ward.D, centroid or median

This section applies to, among others, ward.D2 linkage function.

For those remaining linkage functions stats::hclust accepts regular (unsquared) distance structure calculated on original points. Consequently, as can be seen from the sections above, to replace a stats::hclust call with hclust1d for 1D data, one needs to replace any call to

res <- stats::hclust(d, method = linkage_function_name, members = NULL)

by a call to

res <- hclust1d(d, method = linkage_function_name, distance = TRUE)

Somewhere in the code above this line, d has been computed by a call to stats::dist from a vector of 1D points.

Somewhere below in the code res gets analyzed, but it is OK, because the results of both calls are compatible.

Optional changes to the surrounding code

If the programmer has access to the original points (let’s denote this variable points), the computation of d can be removed (provided it is not used for other purpose) and a call to

res <- stats::hclust(d, method = linkage_function_name, members = NULL)

can be replaced by a call to

res <- hclust1d(points, method = linkage_function_name)