This tutorial shows how euclidean nearest neighbor distances in the geographic space or feature space can be calculated and visualized using CAST. This type of visualization allows to assess whether training data feature a representative coverage of the prediction area and if cross-validation (CV) folds (or independent test data) are adequately chosen to be representative for the prediction locations.
See e.g. Meyer and Pebesma (2022) and Milà et al. (2022) for further discussion on this topic.
As example data, we use two different sets of global virtual reference data: One is a spatial random sample and in the second example, reference data are clustered in geographic space (see Meyer and Pebesma (2022) for more discussions on this).
library(CAST) library(caret) library(terra) library(sf) library(rnaturalearth) library(ggplot2)
Here we can define some parameters to run the example with different settings
<- 10 # random realization seed <- 300 # how many samples will be used? samplesize <- 20 #For clustered samples: How many clusters? nparents <- 500000 # For clustered samples: What is the radius of a cluster?radius
The prediction area is the entire global land area, i.e. we could imagine a prediction task where we aim at making global predictions based on the set of reference data.
<- st_crs("+proj=eqearth") ee <- ne_countries(returnclass = "sf") co <- st_transform(co, ee)co.ee
Then, we simulate the random sample and visualize the data on the entire global prediction area.
sf_use_s2(FALSE) set.seed(seed) <- st_sample(co.ee, samplesize) pts_random ### See points on the map: ggplot() + geom_sf(data = co.ee, fill="#00BFC4",col="#00BFC4") + geom_sf(data = pts_random, color = "#F8766D",size=0.5, shape=3) + guides(fill = FALSE, col = FALSE) + labs(x = NULL, y = NULL)
As second data set we use a clustered design of the same size.
set.seed(seed) sf_use_s2(FALSE) <- clustered_sample(co.ee, samplesize, nparents, radius) pts_clustered ggplot() + geom_sf(data = co.ee, fill="#00BFC4",col="#00BFC4") + geom_sf(data = pts_clustered, color = "#F8766D",size=0.5, shape=3) + guides(fill = FALSE, col = FALSE) + labs(x = NULL, y = NULL)
Then we can plot the distributions of the spatial distances of reference data to their nearest neighbor (“sample-to-sample”) with the distribution of distances from all points of the global land surface to the nearest reference data point (“sample-to-prediction”). Note that samples of prediction locations are used to calculate the sample-to-prediction nearest neighbor distances. Since we’re using a global case study here, throughout this tutorial we use sampling=Fibonacci to draw prediction locations with constant point density on the sphere.
<- plot_geodist(pts_random,co.ee, dist_random sampling="Fibonacci", unit="km", showPlot = FALSE) <- plot_geodist(pts_clustered,co.ee, dist_clstr sampling="Fibonacci", unit="km", showPlot = FALSE) $plot+scale_x_log10(labels=round)+ggtitle("Randomly distributed reference data")dist_random
$plot+scale_x_log10(labels=round)+ggtitle("Clustered reference data")dist_clstr
Note that for the random data set the nearest neighbor distance distribution of the training data is quasi identical to the nearest neighbor distance distribution of the prediction area. In comparison, the second data set has the same number of training data but these are heavily clustered in geographic space. We therefore see that the nearest neighbor distances within the reference data is rather small. Prediction locations, however, are on average much further away.
Let’s use the clustered data set to show how the distribution of spatial nearest neighbor distances during cross-validation can be visualized as well. Therefore, we first use the “default” way of a random 10-fold cross validation where we randomly split the reference data into training and test (see Meyer et al., 2018 and 2019 to see why this might not be a good idea).
<- plot_geodist(pts_clustered,co.ee, dist_clstr sampling="Fibonacci", cvfolds= randomfolds, unit="km", showPlot=FALSE) $plot+scale_x_log10(labels=round)dist_clstr
Obviously the CV folds are not representative for the prediction locations (at least not in terms of distance to a nearest training data point). I.e. when these folds are used for performance assessment of a model, we can expect overly optimistic estimates because we only validate predictions in close proximity to the reference data.
This, however, should not be the case but the CV performance should be regarded as representative for the prediction task. Therefore, we use a spatial CV instead. Here, we use a leave-cluster-out CV, which means that in each iteration, one of the spatial clusters is held back.
<- plot_geodist(pts_clustered,co.ee, dist_clstr sampling="Fibonacci", cvfolds= spatialfolds$indexOut, unit="km", showPlot=FALSE) $plot+scale_x_log10(labels=round)dist_clstr
See that this fits the nearest neighbor distribution of the prediction area much better. Note that plot_geodist also allows inspecting independent test data instead of cross validation folds. See ?plot_geodist.
Recently, Wadoux et al. (2021) published a paper with the title “Spatial cross-validation is not the right way to evaluate map accuracy” where they state that “spatial cross-validation strategies resulted in a grossly pessimistic map accuracy assessment”. Why do they come to this conclusion? The reference data they used in their study where either regularly, random or comparably mildly clustered in geographic space, but they applied spatial CV strategies that held large spatial units back during CV. Here we can see what happens when we apply spatial CV to randomly distributed reference data.
# create a spatial CV for the randomly distributed data. Here: # "leave region-out-CV" sf_use_s2(FALSE) <- st_join(st_as_sf(pts_random),co.ee) pts_random_co ggplot() + geom_sf(data = co.ee, fill="#00BFC4",col="#00BFC4") + geom_sf(data = pts_random_co, aes(color=subregion),size=0.5, shape=3) + scale_color_manual(values=rainbow(length(unique(pts_random_co$subregion))))+ guides(fill = FALSE, col = FALSE) + labs(x = NULL, y = NULL)+ ggtitle("spatial fold membership by color")
<- CreateSpacetimeFolds(pts_random_co,spacevar = "subregion", spfolds_rand k=length(unique(pts_random_co$subregion))) <- plot_geodist(pts_random_co,co.ee, dist_rand_sp sampling="Fibonacci", cvfolds= spfolds_rand$indexOut, unit="km", showPlot=FALSE) $plot+scale_x_log10(labels=round)dist_rand_sp
We see that the nearest neighbor distances during cross-validation don’t match the nearest neighbor distances during prediction. But compared to the section above, this time the cross-validation folds are too far away from reference data. Naturally we would end up with overly pessimistic performance estimates because we make prediction situations during cross-validation harder, compared to what is required during model application to the entire area of interest (here global). The spatial CV chosen here is therefore not suitable for this prediction task, because prediction situations created during CV do not resemble what is encountered during prediction.
A good way to approximate the geographical prediction distances during the CV is to use Nearest Neighbour Distance Matching (NNDM) CV (see Milà et al., 2022 for more details). NNDM CV is a variation of LOO CV in which the empirical distribution function of nearest neighbour distances found during prediction is matched during the CV process.
<- nndm(pts_clustered, modeldomain=co.ee, samplesize = 2000) nndmfolds_clstr <- plot_geodist(pts_clustered,co.ee, dist_clstr sampling = "Fibonacci", cvfolds = nndmfolds_clstr$indx_test, cvtrain = nndmfolds_clstr$indx_train, unit="km", showPlot = FALSE) $plot+scale_x_log10(labels=round)dist_clstr
The NNDM CV-distance distribution matches the sample-to-prediction distribution very well. What happens if we use NNDM CV for the randomly-distributed sampling points instead?
<- nndm(pts_random_co, modeldomain=co.ee, samplesize = 2000) nndmfolds_rand <- plot_geodist(pts_random_co,co.ee, dist_rand sampling = "Fibonacci", cvfolds = nndmfolds_rand$indx_test, cvtrain = nndmfolds_rand$indx_train, unit="km", showPlot = FALSE) $plot+scale_x_log10(labels=round)dist_rand
The NNDM CV-distance still matches the sample-to-prediction distance function.
So far we compared nearest neighbor distances in geographic space. We can also do so in feature space. Therefore, a set of bioclimatic variables are used (https://www.worldclim.org) as features (i.e. predictors) in this virtual prediction task.
<- rast(system.file("extdata","bioclim_global.grd",package="CAST")) predictors_global plot(predictors_global)
Then we visualize nearest neighbor feature space distances under consideration of cross-validation.
# use random CV: <- plot_geodist(pts_clustered,predictors_global, dist_clstr_rCV type = "feature", sampling="Fibonacci", cvfolds = randomfolds, showPlot=FALSE) # use spatial CV: <- plot_geodist(pts_clustered,predictors_global, dist_clstr_sCV type = "feature", sampling="Fibonacci", cvfolds = spatialfolds$indexOut, showPlot=FALSE) # Plot results: $plot+scale_x_log10()+ggtitle("Clustered reference data and random CV")dist_clstr_rCV
$plot+scale_x_log10()+ggtitle("Clustered reference data and spatial CV")dist_clstr_sCV
With regard to the chosen predictor variables we see that again the nearest neighbor distance of the clustered training data is rather small, compared to what is required during prediction. Again the random CV is not representative for the prediction locations while spatial CV is doing a better job.