Bayesian Poisson Mixtures with the Telescoping Sampler

Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter, Bettina Grün

Introduction

In this vignette we fit a Bayesian Poisson mixture with a prior on the number of components \(K\) to a univariate data set of counts, the eye data. We use the prior specification and the telescoping sampler for performing MCMC sampling as proposed in Frühwirth-Schnatter, Malsiner-Walli, and Grün (2021).

First, we load the package.

library("telescope")

The eye data set

The eye data set was previously analyzed in Escobar and West (1998) and Frühwirth-Schnatter and Malsiner-Walli (2019). We directly insert the values into R.

y <- c(34, 24, 22, 17, 17, 15, 15, 14, 12, 12, 11, 11, 10, 10, 9, 9,
       8, 7, 7, 7, 6, 6, 6, 5, 5, 5, 4, 4, 3, 3, 3, 3, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0)
N <- length(y)

The data set is visualized using a barplot.

barplot(table(y), col = "lightgray", xlab = "count", ylab = "Frequency")   

Model specification

For univariate observations \(y_1,\ldots,y_N\) the following model with hierarchical prior structure is assumed:

\[\begin{aligned} y_i \sim \sum_{k=1}^K \eta_k Pois(\mu_k)&\\ K \sim p(K)&\\ \boldsymbol{\eta} \sim Dir(e_0)& \qquad \text{with } e_0 \text{ fixed or } e_0\sim p(e_0) \text { or } e_0=\frac{\alpha}{K}, \text{ with } \alpha \text{ fixed or } \alpha \sim p(\alpha)\\ \mu_k \sim \mathcal{G}(a_0,b_0)& \qquad \text{with }E(\mu_k) = a_0/b_0,\\ b_0 \sim \mathcal{G}(h_0,H_0)& \qquad \text{with }E(b_0) = h_0/H_0. \end{aligned}\]

Specification of the simulation and prior parameters

For MCMC sampling we need to specify Mmax, the maximum number of iterations, thin, the thinning imposed to reduce auto-correlation in the chain by only recording every thined observation, and burnin, the number of burn-in iterations not recorded.

Mmax <- 20000
thin <- 1
burnin <- 1000

The specifications of Mmax and thin imply M, the number of recorded observations.

M <- Mmax/thin

We specify with Kmax the maximum number of components possible during sampling. Kinit denotes the initial number of filled components.

Kmax <- 50  
Kinit <- 10

We fit a dynamic specification on the weights.

G <- "MixDynamic"

For a static specific one would need to use "MixStatic".

For the dynamic setup, we specify the the gamma prior \(\mathcal{G}(1,2)\) on alpha, with \(E(\alpha)=1/2\).

priorOnAlpha <- priorOnAlpha_spec("gam_1_2")

We need to select the prior on K. We use the prior \(K-1 \sim BNB(1, 4, 3)\) as suggested in Frühwirth-Schnatter, Malsiner-Walli, and Grün (2021) for a model-based clustering context.

priorOnK <- priorOnK_spec("BNB_143")

Now, we specify the hyperparameters of the prior on the component-specific rate \(\mu_k\).

a0 <- 0.1 
h0 <- 0.5 
b0 <- a0/mean(y) 
H0 <- h0/b0

We need initial parameters and an initial classification \(S_0\) of the observations.

set.seed(1234) 
cl_y <- kmeans(y, centers = Kinit, nstart = 30)
S_0 <- cl_y$cluster
mu_0 <- t(cl_y$centers)
eta_0 <- rep(1/Kinit, Kinit)

MCMC sampling

Using this prior specification as well as initialization and MCMC settings, we draw samples from the posterior using the telescoping sampler.

The first argument of the sampling function is the data followed by the initial partition and the initial parameter values for component-specific rate and weights. The next argument corresponds to the hyperparameter of the prior setup (a0, b0, h0, H0). Then the setting for the MCMC sampling are specified using M, burnin, thin and Kmax. Finally the prior specification for the weights and the prior on the number of components are given (G, priorOnK, priorOnAlpha).

estGibbs <- samplePoisMixture(
    y, S_0, mu_0, eta_0, 
    a0, b0, h0, H0,
    M, burnin, thin, Kmax, 
    G, priorOnK, priorOnAlpha)

The sampling function returns a named list where the sampled parameters and latent variables are contained. The list includes the component rates Mu, the weights Eta, the assignments S, the number of observations Nk assigned to components, the number of components K, the number of filled components Kplus, parameter values corresponding to the mode of the nonnormalized posterior nonnormpost_mode, the acceptance rate in the MH step when sampling \(\alpha\) or \(e_0\), \(\alpha\) and \(e_0\). These values can be extracted for post-processing.

Mu <- estGibbs$Mu
Eta <- estGibbs$Eta
S <- estGibbs$S
Nk <- estGibbs$Nk
K <- estGibbs$K
Kplus <- estGibbs$Kplus
nonnormpost_mode <- estGibbs$nonnormpost_mode
acc <- estGibbs$acc
e0 <- estGibbs$e0
alpha <- estGibbs$alpha 

Convergence diagnostics of the run

We inspect the acceptance rate when sampling \(\alpha\) and the trace plot of the sampled \(\alpha\):

sum(acc)/M
## [1] 0.35265
plot(1:length(alpha), alpha,
     type = "l", ylim = c(0, max(alpha)),
     xlab = "iterations", ylab = expression(alpha))

In addition a histogram is also created.

hist(alpha, freq = FALSE, breaks = 50)

We also characterize the distribution of \(\alpha\) using location metrics and quantiles:

mean(alpha)
median(alpha)
quantile(alpha, probs = c(0.25, 0.5, 0.75))

The trace plot of the induced hyperparameter \(e_0\) is also visualized:

plot(1:length(e0), e0,
     type = "l", ylim = c(0, max(e0)),
     xlab = "iterations", ylab = expression(e["0"]))

hist(e0, freq = FALSE, breaks = 50)

mean(e0)
## [1] 0.2025446
quantile(e0, probs = c(0.25, 0.5, 0.75))
##       25%       50%       75% 
## 0.1044981 0.1703785 0.2650444

To further assess convergence, we also inspect the trace plots of the number of components \(K\) and the number of filled components \(K_+\).

plot(1:length(K), K, type = "l", ylim = c(0, 30), 
     xlab = "iteration", main = "", ylab = "number",
     lwd = 0.5, col = "grey")
points(1:length(K), Kplus, type = "l", col = "red3",
       lwd = 2, lty = 1)
legend("topright", legend = c("K", expression(K["+"])),
       col = c("grey", "red3"), lwd = 2)

Identification of the mixture model

Clustering the data is not the main purpose for the eye data set, but rather capturing the heterogeneity in the rate parameter through a mixture. Aiming to identify the mixture model by determining a suitable number of data clusters and obtaining an identified model for this number using the clustering procedure in the point process representation fails indicating that the fitted model does not represent a suitable model for clustering.

Step 1: Estimate \(K_+\) and \(K\)

We determine the posterior distribution of the number of filled components \(K_+\) approximated using the telescoping sampler. We visualize the distribution using a barplot.

Kplus <- rowSums(Nk != 0, na.rm = TRUE)  
p_Kplus <- tabulate(Kplus, nbins = max(Kplus))/M  
barplot(p_Kplus/sum(p_Kplus), names = 1:length(p_Kplus), 
        col = "red3", xlab = expression(K["+"]),
        ylab = expression("p(" ~ K["+"] ~ "|" ~ bold(y) ~ ")"))

The distribution of \(K_+\) is also characterized using the 1st and 3rd quartile as well as the median.

quantile(Kplus, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75% 
##   5   6   7

We obtain a point estimate for \(K_+\) by taking the mode and determine the number of MCMC draws where exactly \(K_+\) components were filled.

Kplus_hat <- which.max(p_Kplus)
Kplus_hat
## [1] 5
M0 <- sum(Kplus == Kplus_hat)
M0          
## [1] 5229

We also determine the posterior distribution of the number of components \(K\) directly drawn using the telescoping sampler.

p_K <- tabulate(K, nbins = max(K, na.rm = TRUE))/M
round(p_K[1:20], digits = 2)
##  [1] 0.00 0.00 0.01 0.08 0.12 0.13 0.12 0.10 0.08 0.07 0.05 0.04 0.03 0.03 0.02
## [16] 0.02 0.01 0.01 0.01 0.01
barplot(p_K/sum(p_K), names = 1:length(p_K), xlab = "K", 
        ylab = expression("p(" ~ K ~ "|" ~ bold(y) ~ ")"))

Again the posterior mode can be determined as well as the posterior mean and quantiles of the posterior.

which.max(tabulate(K, nbins = max(K)))
## [1] 6
mean(K)
## [1] 9.52205
quantile(K, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75% 
##   6   8  11

Step 2: Extract those draws which correspond to samples with exactly \(\hat{K}_+\) filled components

First, we select those draws where the number of non-empty groups was exactly \(\hat{K}_+\).

index <- Kplus == Kplus_hat
Nk[is.na(Nk)] <- 0
Nk_Kplus <- (Nk[index, ] > 0)

In the following we extract the cluster-specific rates, the data cluster sizes and the cluster assignments for the draws where exactly \(\hat{K}_+\) components were filled.

Mu_inter <- Mu[index, , , drop = FALSE]
Mu_Kplus <- array(0, dim = c(M0, 1, Kplus_hat)) 
Mu_Kplus[, 1, ] <- Mu_inter[, 1, ][Nk_Kplus]

Eta_inter <- Eta[index, ]
Eta_Kplus <- matrix(Eta_inter[Nk_Kplus], ncol = Kplus_hat)
Eta_Kplus <- sweep(Eta_Kplus, 1, rowSums(Eta_Kplus), "/")

w <- which(index)
S_Kplus <- matrix(0, M0, length(y))
for (i in seq_along(w)) {
    m <- w[i]
    perm_S <- rep(0, Kmax)
    perm_S[Nk[m, ] != 0] <- 1:Kplus_hat
    S_Kplus[i, ] <- perm_S[S[m, ]]
}

Step 3: Clustering and relabeling of the MCMC draws, with \(\hat{K}_+\) being the estimated number of data clusters

For model identification, we cluster the draws of the rates where exactly \(\hat{K}_+\) components were filled in the point process representation using \(k\)-means clustering.

Func_init <- matrix(nonnormpost_mode[[Kplus_hat]]$mean_muk,
                    nrow = Kplus_hat)
identified_Kplus <- identifyMixture(
    Mu_Kplus, Mu_Kplus, Eta_Kplus, S_Kplus, Func_init)
identified_Kplus$non_perm_rate
## [1] 1

The non-permutation rate is 1. This means that in no iteration a unique assignment of draws to components could be made, indicating a strong overfit of the number of clusters.

References

Escobar, Michael D, and Mike West. 1998. “Computing Nonparametric Hierarchical Models.” In Practical Nonparametric and Semiparametric Bayesian Statistics, 1–22. Springer.
Frühwirth-Schnatter, Sylvia, and Gertraud Malsiner-Walli. 2019. “From Here to Infinity: Sparse Finite Versus Dirichlet Process Mixtures in Model-Based Clustering.” Advances in Data Analysis and Classification 13: 33–64.
Frühwirth-Schnatter, Sylvia, Gertraud Malsiner-Walli, and Bettina Grün. 2021. “Generalized Mixtures of Finite Mixtures and Telescoping Sampling.” Bayesian Analysis 16 (4): 1279–1307. https://doi.org/10.1214/21-BA1294.