Computationally efficient joint species distribution modeling of big spatial data

Abstract The ongoing global change and the increased interest in macroecological processes call for the analysis of spatially extensive data on species communities to understand and forecast distributional changes of biodiversity. Recently developed joint species distribution models can deal with numerous species efficiently, while explicitly accounting for spatial structure in the data. However, their applicability is generally limited to relatively small spatial data sets because of their severe computational scaling as the number of spatial locations increases. In this work, we propose a practical alleviation of this scalability constraint for joint species modeling by exploiting two spatial‐statistics techniques that facilitate the analysis of large spatial data sets: Gaussian predictive process and nearest‐neighbor Gaussian process. We devised an efficient Gibbs posterior sampling algorithm for Bayesian model fitting that allows us to analyze community data sets consisting of hundreds of species sampled from up to hundreds of thousands of spatial units. The performance of these methods is demonstrated using an extensive plant data set of 30,955 spatial units as a case study. We provide an implementation of the presented methods as an extension to the hierarchical modeling of species communities framework.


HMSC specification in matrix notation
The HMSC specification, provided in Materials & Methods section of the main text also could be elegantly and rigorously formally written in matrix form using following notation:
Data objects in the model • The ' ( × ' * matrix Y of the recorded species abundances/occurrences 5 67 • The ' ( × ' , matrix X of the covariates : 6; • The ' * × ' . matrix T of the species traits = 7> • The ' * × ' * symmetric positive definite matrix C of the species phylogenetic similarities • The ' ( × ' 2 matrix S of the sites' coordinates A 62

Model parameters
• The ' ( × ' * matrix L of latent variables -67 standing for location parameters of the data distribution • The ' , × ' * matrix Β of the D ;7 of the species responses to the covariates • The ' , × ' * matrix Μ of the F ;7 of the trait-expected species responses to the covariates • The ' , × ' . matrix Γ of the H ;> the impacts of trait values on the expected species response to the covariates • The ' , × ' , matrix V standing for the covariance of responses to covariates across species that could not be attributed to available traits.
• Scalar J standing for the strength of the impact of phytogenic similarity to similarity in responses to covariates. • The ' 0 × ' * matrix Λ is the matrix of latent factor loadings L M7 • The ' ( × ' 0 matrix Η of the latent factors N 6M • The ' 0 × 1 vector O of the spatial ranges of latent factors. • The ' ( × ' * matrix Z of latent liabilities Q 67 = -67 + S 67 used for implementation of various types of observational data through data augmentation. For theoretical ground see e.g. Albert and Chib (1993) or Zhou et al. (2012), for HMSC-contexed usage see e.g. Ovaskainen et al (2016a) • The ' * × ' * diagonal matrix Σ of residual variances, with diagonal elements U 7 V

Matrix-vector notation
We denote by vec(•) the operator which stacks consecutive columns of a matrix on top of each other. We denote by small letters in bold font the vectors that are obtained by applying vec(•) to corresponding matrices, so that e.g. ] = vec(Γ), ^= vec(Β), _ = vec(Λ), `= vec(Η), a = vec(L), b = vec(Z). A star in the upperscript indicates that the transpose was applied first to the matrix, so e.g. b * = vec(Z d ) and ` * = vec(Η d ). We denote by 0 f×g the ' × h matrix of ones, by I f the ' × ' identity matrix, by tr(A) the trace of the matrix A, by ⨂ the Kronecker product, and by ∘ the Hadamard product (the entry-wise product).

Distributions
• We denote by o(p, Σ) the multivariate normal (Gaussain) distribution with mean p and covariance matrix Σ.
• We denote by W(V, s) the Wishart distribution with scale matrix V and degrees of freedom parameter s, and by W tu (V, s) the inverse Wishart distribution with scale matrix V and degrees of freedom parameter s. Thus if V~W tu (V w , s), then V tu~W (V w tu , s).
• We denote by Γ(x, y) the Gamma distribution with shape x and rate y, which parametrization is common to Bayesian statistics, so that the distribution's mean is x/y.

The Hierarchical Model of Species Communities
We follow the generalized linear modelling paradigm and model that the observations for the )-th species with a statistical distribution { 7 and link function | 7 that are compatible with the type of observed data for this species. Then }~5 67 = | 7 tu~-67 , ÄxÅ~5 67 = ÄxÅ Ç É~-67 , U 7 , L = XB + ΗΛ, We would like to mention, that while in this study we consider only one level of random factors in the model, which corresponds to level of single observational sites, the proposed model is trivially generalizable to more complex sampling designs with several hierarchical or crossed levels, see e.g. Ovaskainen et al. (2016a).

Role of traits and phylogeny
To improve the performance of the model with sparse data or rare species, as well as to exploit potentially available information on species-specific traits and phylogenetic relationships, we follow Ovaskainen et al. (2017) and impose a multivariate Gaussian conditional prior for the regression coefficients as ^= áD u⋅ , … , D f Ü ⋅ ä d~o (vec(Μ d ), Θ). The matrix Μ ∈ ℝ f Ü ×f â consists of the elements F ;7 that describe the expected response of species ) to covariate +. The expected response of species ) to covariate + is modeled based on this species' traits as F ;7 = ∑ H ;> = 7> f ê >ëu , where = 7> is the value of traitfor species ) (with = 7u = 1 modeling the intercept), and the parameter H ;> measures the effect of traiton the expected response to covariate +. Matrix Θ models the variation of responses among individual species around the traitbased expectation as Θ = áJC + (1 − J)I f â ä ⊗ V, where ⊗ denotes Kronecker's product, the positive-definite matrix V ∈ ℝ f Ü ×f Ü models the dependency between responses to different covariates, the parameter 0 ≤ J ≤ 1 determines the impact of phylogenetic relationships on species responses to the covariates, and the matrix C ∈ ℝ f â ×f â is a phylogenetic correlation matrix, which is assumed to be known prior to the analysis. The model can be applied without trait data by including the intercept as the only species trait, and it can be applied without phylogenetic data by fixing J = 0.

Priors
Here we list the families of the priors that we assign, and which are essential for our sampling algorithm. We also provide the "default" values for those, which correspond to weakly informative prior that is recommended for practical use in case when no or very few additional information is available a priori (applied with normalized input data).
• The prior for those diagonal elements of Σ that are not fixed due to the selected data distribution { 7 is U 7 tV~G a(x 7 , y 7 ). The prior parameters are the scalars x 7 and y 7 . Their default values are x 7 = 1 and y 7 = 0.3. • For Λ ; Λ ; û , we assume the multiplicative gamma process shrinkage prior suggested by Bhattacharya and Dunson (2011), in which the degree of shrinkage increases with the factor number. Thus, , ¶ u~G a(x u , y u ), ¶ >~G a(x V , y V ) for -≥ 2.
• The authors of this method proposed that the parameter s is fixed to £ = 3 and the parameters y u and y V are fixed to y u = y V = 1. The parameters x u and x V are selected by the user to define the level of shrinkage , with x u tuning the basic level and x V > 1 the increase in shrinkage with increasing number of the factor. The default values are x u = x V = 5. In case of many species with sparse data, it may be useful to increase these parameters to increase shrinkage (and thus decrease the estimated number of latent factors), although the original study claims that the method is quite robust to the choice of these hyperparameters.
• If the latent factors are not spatially structured, the prior for Η is `~o ¢0, I f´⨂ I f¨ § . If the latent factors are spatially structured, the prior for Η is `~o(0, K AE ), where K AE is the block-diagonal matrix K AE = diag (K u

Selecting the number of latent factors
Determining the appropriate number of latent factors is fundamental for the HMSC model specified above. While in certain cases, this number can be obtained based on an informed expert opinion guess, generally such information is unavailable before the analysis. Some previous works have suggested methods for a proper Bayesian treatment of ' 0 , and estimating it during the MCMC sampling (Lopes and West 2004). However, such scheme requires changing the domain of the parameter space during sampling and the proposed reversible-jump MCMC does not scale well. Instead, Bhattacharya and Dunson (2011) devised a formulation of infinite factor model, where the number of latent factors is assumed to be infinite, but the latent loadings of higher factors are shrink by assigning multiplicative Gamma process prior. In practice, the authors proposed to use adaptive tuning of ' 0 during the warm-up phase of MCMC scheme, based on discarding the latent factors, which latent loadings do not exceed certain pre-defined threshold. This method was implemented in the HMSC implementation presented in Ovaskainen et al. (2017) and consequently is used in this work.
However, our experience with HMSC model indicates that with insufficient adaptation period or too severe shrinkage the estimated number of latent factors can be suboptimal, while with too mild shrinkage, the estimated number of factors can be unnecessarily high. The later generally does not affect the quality of the model fit due to originally infinite number of factors assumed, but is undesirable in spatial models, since it dramatically increases the computational load. An even more robust, although numerically very costly scheme to estimate the proper number of latent factors is to iteratively run several instances of HMSC model in cross-validation manner, while varying the number of latent factors. Then, the model with best cross-validation predictive performance is likely to contain the number of latent factors that is close to the truly optimal (may depend on the of CV fold splitting strategy).
Approximate Gaussian process priors for latent factors for big spatial data Gaussian predictive process The Gaussian predictive process (GPP) denoted by ó ∫(A), is constructed from the values of the original GP ó(A) defined at h 'knot' locations ª * = {A u * , … , A g * }. Therefore, the value of the GPP at any site A w is given by ó ∫ (A w ) = }(ó(A w )|ó * ) = ae * ø ¿ * ae ¿ * ¿ * tu ó * , where ó * = [ó(A u * ), … , ó(A g * ) ] d denotes the vector of the original GP values at the knot locations ª * , ae ¿ * ¿ * = á+~A 6 √ * , A 6 ƒ * ä 6 ƒ ëu…g 6 √ ëu…g and ae * ø ¿ * = [+(A w , A u * ), … , +(A w , A g * )]. With this definition, it follows that ó ∫ is itself a GP: is non-stationary but factorizable (Banerjee et al. 2008). As mentioned in the main text, our interpretation of the covariance matrix Ω assumes that the marginal prior distributions of the latent factors is standard normal. However, the GPP fails to fulfill that the marginal distribution of latent factors is standard normal since its marginal variance generally decreases with increasing distance from the knot set ª * . To circumvent this misbehavior, we followed Finley et al. (2009) and applied a correction to the marginal prior variance of the GPP, so that it always equals that of the original GP: , and therefore, K … M is a valid correlation matrix.
As far as we are aware, the most similar model to GPP-augmented HMSC was proposed by Ren and Banerjee (2013), where the authors also coupled GPP with factor modelling for analysis of multivariate environmental data under the assumption of Gaussian noise.

Nearest Neighbor Gaussian process
Nearest Neighbor Gaussian Process (NNGP) builds upon a special sparse approximation of the GP precision matrix that is related to the conditional representation of the original GP (Datta et al. 2016b). Given a specified ordering over the set of sites ª = ÀA u , … , A f¨Ã the process ó(A)~GP~0, +(A u , A V ) over this set corresponds to multivariate Gaussian distribution Õ = Àó u , … , ó f¨Ã d = Àó(A u ), … , ó ¢A f¨ § Ã d~o (0, ae ¿¿ ) that can be specified in conditional manner: ó u~o (0, ae uu ),~ó 6 |ó 7 , ) < #~o(F 6 , 1 6 ) ∀# ∈ 2 … ' ( , where A is the strictly lower triangular matrix with elements x 67 and D is the diagonal matrix with elements 1 6 . The Nearest Neighbor approach approximates the conditional distribution ~ó 6 |ó 7 , ) < #~o(F 6 , 1 6 ) by conditioning only on the h preceding closest neighbors of A 6 : ~ó 6 |ó 7 , ) < # ≈ ¢ó 6 |ó 7 , ) ∈ o g (#) §, where o g (#) = á' u 6 , … , ' g ∫ 6 ä is the subset of {1, … , # − 1 } of size h ∫ = min(h, # − 1) that contains the indices of at most h closest neighbors of A 6 . This results in the following adjusted formulas: , Àx with sparse matrix A … , which non-zero elements are obtained via the expressions above. Hence the precision matrix K … tu =~I − A … d D … tu~I − A … is also sparse with ‹~' ( h V non-zero entries. Another crucial property of this matrix is that all its non-zero elements tend to be close to the diagonal (the exact measure of how far the non-zero elements could be away from the diagonal depends on the coordinates of the sites and selected ordering; the practical advice is that the ordering should be selected to minimize it in order to enhance performance). This imposes that the Cholesky decomposition ›› d = K … tu + { fi , where { fi is a diagonal matrix with non-negative elements is also sparse. The enhanced computational efficiency of the NNGP method is achieved due to the decreased cost of sparse matrix operations compared to their dense counterparts. A detailed review of how the sparsity of NNGP can be harnessed for numerical speed-up of Bayesian inference we is given in Finley et al. (2019).
Recently, Taylor-Rodriguez et al. (2018) proposed a similar blend of NNGP and latent factors to build a 2-stage probabilistic model linking together areal LiDAR data and forest inventory observations. However, the sequential Gibbs updater for latent factors implemented in that work, is principally different from our block implementation that follows the original note on using sparse Cholesky by Datta et al. (2016a).
One practical challenge related to NNGP is that the approximation is non-invariant w.r.t. the selected ordering of the set of locations. Vecchia (1992) and Stein et al. (2004) asserted that similar conditional approximations are non-sensitive to the ordering. (Datta et al. 2016b) conducted a numerical experiment that demonstrated that results are practically invariant to the ordering choice in terms of root mean square predictive errors this choice when the ordering is selected along any direction in the spatial coordinate space. However, the maximum-minimum distance ordering, recently proposed by Guinness (2018) resulted in substantially lower Kullback-Leibler (KL) divergence of approximation, compared to the geographical gradient ordering. Further, in multivariate case there is a possibility that the minimum posterior KL divergence between the original GP-based HMSC model and NNGP-based approximation could be achieved using different orderings for different latent factors. Therefore, to keep the focus on practical multivariate ecological spatial data analysis specifics of our study, we left the ordering choice comparisons to other studies and in the presented experiments always ordered the sites according to their longitude from West to East.

Gibbs MCMC sampling algorithm
We extended Gibbs posterior sampling algorithm for standard HMSC's parameters for the case when latent factors N 6M are assigned with Gaussian predictive process or Nearest Neighbor Gaussian process approximations of their original full Gaussian process prior. We implemented the extended sampling algorithm in the Matlab version of the HMSC package ). Here we only present those steps from the overall sampling scheme that differ from previously published works (Bhattacharya and Dunson 2011, Ovaskainen et al. 2016a, Ovaskainen et al. 2016b, namely the full-conditional updaters for Η and O.

Gaussian predictive process
Full-conditional updater for Η If the latent factors are assigned GPP prior, then the full-conditional distribution for ` follows: However, the direct calculation does not bring any computational advantage compared to using latent factors assigned with GP prior. Instead, ` could be drawn from the following expressions (see Lemma 1 for details on computations)

Nearest Neighbor Gaussian process
Full-conditional updater for Η If the latent factors were assigned NNGP prior, then the full-conditional distribution for ` follows: (`| −)~o~F fl , U fl , U fl tu = K … AE tu + Λ d Σ tu Λ ⊗ ‡ f¨, p fl = U fl vec(·Σ tu Λ) Despite of the precision matrix U fl * tu being sparse, neither its inverse, nor Cholesky decomposition are necessarily sparse.
This effective negates all potential computational benefits due to sparsity in the K … AE tu . So, instead of sampling ` directly, we obtain it as the permutation of ` * , which could be obtained through following formulas (` * | −)~o~p fl * , U fl * , U fl * tu = Ô fl * fl K … AE tu Ô fl * fl d + ‡ f¨⊗ Λ d Σ tu Λ, p fl * = U fl * vec((·Σ tu Λ) d ) Where the Ô fl * fl is the transposition matrix that transforms ` to ` * : ` * = Ô fl * fl`. Now, the U fl * tu matrix has a special structure -if considering it as a block matrix with ' ( × ' ( blocks of size ' 0 × ' 0 , all of its non-zero blocks are located at the same places as the non-zero elements of ~K … M . So, the non-zero elements of U fl * tu are in the proximity of the diagonal, which allows for a sparse Cholesky factorization U fl * tu = › fl * › fl * d . Exact number of non-zero elements depends on configuration of sites and ordering, but it can be shown that for sites located at the vertices of a uniform square grid the number of non-zero elements would be of order ‹ h' 0 V ' ( The associated computational cost is mainly due to the sparse Cholesky decomposition and double left division of a ' ( ' 0length vector to a ' ( ' 0 × ' ( ' 0 sparse triangular matrix. In our Matlab implementation of this updater we use the Matlab's implementation of Cholesky factorization -function chol(). This function also provides an option to internally perform approximate minimum degree (AMD) permutation in order to get a sparser Cholesky factor. Unfortunately, we cannot report the exact algorithm that is used there, since is not publicly unclosed. We propose to keep this option on, since in our numerical experiments it allowed for an approximately 10-15% additional speed-up for the updater, although it is likely that for certain site configurations it would only generate unnecessary minor overheat (e.g. when all sites are on a single straight line). As a side remark, we would like to mention that applying this function directly to U fl tu does not produce any reasonably sparse Cholesky factor due to heuristic nature of AMD algorithm, hence the block-ordered permutation to U fl * tu is indeed essential.

Full-conditional updater for O
Conditioning on Η, spatial range parameters O M are independent for ℎ = 1 … ' 0 and can be sampled one by one from the prior values O ñ proportional to their conditional posterior probabilities. We follow the sampling scheme presented in Ovaskainen et al. (2016b), but exploit the special structure of NNGP-induced covariance matrices K … M .
For each latent factor this scheme requires ' ¥ µ calculations of the quadratic form N ⋅M d~K … M tu N ⋅M , and additionally ' ¥ µ calculations of ÍK … M Í that are shared among the latent factors. However, in case of large ' ( direct sampling from this distribution is problematic due to the need to invert and decompose the large dense matrix U fl tu . Instead the special form of " can be exploited for enhanced performance. First using the Woodbury identity " tu = (" uV " VV " Vu + {) tu = { tu − { tu " uV (" VV + " Vu { tu " uV ) tu " Vu { tu Then denoting by Ë = " VV + " Vu { tu " uV , and by ‚ = Λ d Σ tu Λ ⊗ ‡ f¨+ { tu , Which can be further expanded with Woodbury identity Matrix Á = Ë − " Vu { tu ‚ tu { tu " uV has the size ' ; ' 0 × ' ; ' 0 , and its inverse Á tu and its Cholesky decomposition › ‰ Â√ could be easily computed once the number of Gaussian predictive process knots is relatively small.
The inverse matrix ‚ tu can be also effectively calculated due to its special structure.
This has the numerical complexity of ‹ Û~m ax~h' 0 V ' ( , h Ì ' 0 Ì flops, so scaling linearly as the number of sites.

Posterior predictive distribution
Once the draws from the posterior of HMSC model parameters have been acquired, they can be used for making predictions at any location, where the values of covariates are known. We denote the desired prediction location by ˙ * , the set of covariates included to the fixed effects component of the HMSC model at that location by ˚ * , and the vector of predicted outcomes by ¸ * . We denote the set of all model parameters, specified in the corresponding section above by ˝. which enables to obtain conditional samples of the joint ` * via sampling from univariate conditional distributions for different latent factors ℎ. This allows to retain linear asymptotic complexity in predictive distribution with respect to the number of latent factors ' 0 in the HMSC. Formulas for efficient sampling from univariate conditional variance-corrected GPP and NNGP distributions follow the original strategies (Finley et al. 2009, Datta et al. 2016a).

Details on Australian plants case study
The data originate from the Victorian Biodiversity Atlas (VBA) (https://www.environment.vic.gov.au/biodiversity/victorianbiodiversity-atlas), which is a state database that collaborates with the Atlas of Living Australia (http://www.ala.org.au). The subset of the VBA used in this study involves the occurrences of 1237 herbaceous species, at 30,955 sampling locations within the State of Victoria, Australia ( Fig. 2A), for which presence-absence were recorded. The data were collected in years 1984-2014 on sampling plots of 3900 m 2 , The number of unique survey teams involved in the collection of these data is not known accurately, but is in the order of 200-300. The dataset combines survey data undertaken for a range of purposes the predominant being: 1. Ecosystem inventory, circumscription and mapping 2. Characterizing the habitats of species of management interest 3. Documenting and describing land subject to development or land-use change Consequently, the data is biased towards sampling public lands, typically less suitable for agriculture and peri-urban areas.
We selected four environmental covariates that were considered potentially important to vegetation and plant distribution and were not strongly correlated. These measure: 1. Climatic conditions -Mean maximum temperature in January (the hottest and driest month in south eastern Australia), developed using ANUCLIM (Houlder et al 2000). See Appendix S1: Fig. S1A. 2. Hydrology and landscape position -This a summed and normalized set of 'vertical distance above stream' calculations (Conrad et. al. 2015) for seven different channel networks, each of which satisfy seven separate, monotonically increasing, flow accumulation thresholds (based on catchment size weighted by catchment rainfall). See Appendix S1: Fig. S1B. 3. Soil properties -Here we used the radioelement count of thorium as a general proxy for soil type. Radiometric data is related to soil depth, soil texture and nutrition particularly in surficial landscapes. See Read et al. (2018) and Appendix S1: Fig. S1C. 4. Solar radiation and anisotrophic heating. These data have been derived from the transformation of a digital elevation model to indicate the relative level of terrain illumination when the sun is at 270 degrees (North-West) and 40 degrees above the horizon. We also included available information on 9 species traits as binary indicator variables, describing whether the species (1) is annual or perennial, (2) is pollinated by abiotic or biotic means, (3-4) has propagules that are dispersed by wind, invertebrates, or another agent, (5) forms a seed bank that typically persists for two or more years, and is considered vulnerable to or tolerant of (6) fire, (7) prolonged snow cover, (8) protracted waterlogging, or (9) salinity. These traits were selected from a much larger list of expert-provided traits that potentially govern the species distribution in the studied community. The particular choice of those included to the model was governed by a) amount of collinearity between different available traits and b) availability of trait values for all the studied species.

Details on model convergence
As described in the main text, we fitted all models with 10,000 MCMC steps, out of which we discarded the first 2,000 steps as burn in. We thinned the remaining samples by 10, resulting in 800 posterior draws. To examine the convergence of the MCMC chains, we repeated model fitting 40 times, randomly selecting initial parameter values from the prior distribution.
We assessed the quality of mixing by calculating the effective sample sizes (ESS) and potential scale reduction factors Figure S1. Spatial distribution of environmental covariates included to the model.

East East East
Average temperature in January Normalized vertical water proximity Radioelement count of thorium (PSRF) for the model parameters (Gelman and Rubin 1992). However, as the prior of Bhattacharya and Dunson (2011) leads to non-identifiable parameters H and Λ, and the number of unique entries in Ω = Λ û Λ is extremely high for model fitted to all data, we restricted our calculations to the B parameters and a randomly chosen 40 ✕ 40 symmetric submatrix of association matrix ˜= cov2cor~Ω + ‡ f â , as those are fundamental in the ecological applications and generally representative for overall model mixing. Hence, for each variant of fitted model we stacked the 40 chains, each containing 800 draws of B and ˜ parameters, and calculated the ESS and PSRF with effectiveSize() and gelman.diag() functions implemented in coda R package. To reduce still enormously high number of quantities, we calculated a single quantity for each model variant -the 5% quantile of the ESS and the 95% quantile of the PSRF point estimates among the D ;7 or ˜7 7 ≤ parameters in this model variant. The resulted values are summarized in the Appendix S1: Fig. S2, Fig. S3, Fig.  S4, Fig. S5.
As is clearly seen from the visualization of the effective sample sizes and potential scale reduction factors, the algorithm with selected number of steps and thinning demonstrated generally adequate mixing for numbers of training sites up to 1600 and got significantly worse when the number of sites further increased. The drop is especially pronounced for the elements of association matrix ˜7 7 ≤ . On the other hand, the number of species exhibited opposite impact -with higher number of species the effective sample sizes were generally higher than with low number of species. Perhaps somewhat unexpectedly, the mixing in model fits with high number of training sites got insufficient also for non-spatial model. These results suggest that to keep the results of Bayesian analysis properly valid (especially concerning uncertainty quantifications), the number of samples or thinning for models fitted to large data must be increased, which raise the right parts of the expected computation times that are shown in the Fig. 1A-F.
Despite of the fact that the bigger model indicated insufficient mixing, in terms of predictive performance our results were quite stable between different chains and qualitatively repeated the results averaged over the chains. This provides an empirical confirmation that the predictive performance is expected to behave as reported in Fig. 1 and would not considerably depend on initial starting position of MCMC chains. Appendix S1: Fig. S6 and Fig. S7 extend the Fig. 1 GHI and Fig. 1 JKL correspondingly by presenting the predictive measures calculated for each of the independent chains. The visualization of performance distribution for each model variant is constructed using beanplot function with standard smoothing parameters from beanplot R package.
Based on the complications with mixing that we have encountered when fitting models to the larger datasets, we would like to summarize that the block Gibbs sampling algorithm, presented by Ovaskainen et al. (2017) is insufficiently efficient for modelling big datasets, at least when the outcomes are binary. One potential bottleneck is due to known inefficiencies of the data augmentation of Albert and Chib (1993) that is used for dealing with binary data, which leads to slow mixing for unbalanced outcomes, which probability is close to zero or one. The fundamental problem comes from the great mismatch of marginal posterior and the conditional distribution given the augmented data. Some recent work has been conducted aiming to efficiently deal with this issue ), specifically to "widen" the conditional distribution at the cost of introducing a rejection probability. However, its utility has been demonstrated on a significantly simpler models and transition of those results to HMSC is not devised yet. Another opportunity, which currently seems more promising in our opinion, is to investigate how combination of marginal representation of HSMC's latent liabilities as a Gaussian process could be coupled with approximate methods for dealing with non-Gaussian observations (e.g. Laplace approximation, expectation propagation, variational inference). While the resulted GP would be of dimension ' ( ' * , which in case of our biggest training dataset is over 1.6 ⋅ 10 0 and prohibits full GP fitting approach, the special structure of GP's covariance matrix induced by HMSC structure provides opportunities for much more efficient solutions. For the GPP model, such method would have the same flavor as the approximate inference methods for Gaussian process regression/classification (Hensman et al. 2015). For the NNGP model, such approaches seem to be conceptually more challenging as the matrix sparsity there is tricky to utilize in combination with other marginalized model components. Thus, we believe that the first steps in that direction should be in developing an extension of the collapsed NNGP method, proposed by Finley et al. (2019), which extension would additionally marginalizing out the fixed effects and be applicable to non-Gaussian residuals.
To sum up, we would like to mark this research question as a potential area of interest for statisticians and machine learner researchers, specializing in developing methods for multivariate Bayesian data analysis. Figure S2. Visualization of effective sample sizes for covariate responses based on multiple fitted independent chains for each of model variants. The numbers depict the 5% quantiles among the ! "# parameters. Figure S3. Visualization of effective sample sizes for association matrices based on multiple fitted independent chains for each of model variants. The numbers depict the 5% quantiles among the off-diagonal $ ## % parameters.

Full
14 Figure S4. Visualization of PSRF for covariate responses based on multiple fitted independent chains for each of model variants. The numbers depict the 95% quantiles among the ! "# parameters. Figure S5. Visualization of PSRF for association matrices based on multiple fitted independent chains for each of model variants. The numbers depict the 95% quantiles among the off-diagonal $ ## % parameters. Figure S6. Extension of Fig. 1EFG