Statistically reinforced machine learning for nonlinear patterns and variable interactions

. Most statistical models assume linearity and few variable interactions, even though real-world ecological patterns often result from nonlinear and highly interactive processes. We here introduce a set of novel empirical modeling techniques which can address this mismatch: statistically reinforced machine learning. We demonstrate the behaviors of three techniques (conditional inference tree, model-based tree, and permutation-based random forest) by analyzing an arti ﬁ cially generated example dataset that contains patterns based on nonlinearity and variable interactions. The results show the potential of statistically reinforced machine learning algorithms to detect nonlinear relationships and higher-order interactions. Estimation reliability for any technique, however, depended on sample size. The applications of statistically reinforced machine learning approaches would be particularly bene ﬁ cial for investigating (1) novel patterns for which shapes cannot be assumed a priori, (2) higher-order interactions which are often overlooked in parametric statistics, (3) context dependency where patterns change depending on other conditions, (4) signi ﬁ cance and effect sizes of variables while taking nonlinearity and variable interactions into account, and (5) a hypothesis using parametric statistics after identifying patterns using statistically reinforced machine learning techniques.


INTRODUCTION
Ecological patterns structured by nonlinear and interactive processes are ubiquitously found at any ecological scales from individual to ecosystem in terrestrial, freshwater, and marine systems (e.g., Dodds et al. 2010, van Nes et al. 2016, Soranno et al. 2014, see also examples in Table 1). Nonlinearity and interactions among factors driving an ecological system often cause unexpected changes in the system's state (Peters et al. 2007, Scheffer 2009, Soranno et al. 2014, for example, as seen in alternative stable states (van Nes et al. 2016). Understanding and modeling such nonlinear interaction dynamics inevitably will make ecological systems more predictable (Urban et al. 2016, Mayfield andStouffer 2017).
Yet, analyzing complex patterns with statistical modeling is not so easy. Statistical models per se are capable of analyzing nonlinear relationships and variable interactions. However, the design of a statistical model is dependent heavily on how the user views the system. Model designs directly affect system interpretation (e.g., Gilbert andBennett 2010, Cumming 2016). For practical convenience, the vast majority of statistical models rely on linearity, few variable interactions, and data transformation for normality in ecology. The subjectivity in statistical model design is perhaps one of the major obstacles to advance our understanding of nonlinear interaction dynamics in ecological systems.
Machine learning, in contrast to statistical modeling, explores the structure of the target system without pre-assumption on data (Breiman 2001a, Recknagel 2001. Machine learning algorithms aim to build an empirical model that maximizes predictability. Most of them automatically and thoroughly examine possible nonlinear relationships and higher-order interactions (greater than twoway). They generally explain ecological patterns more accurately than statistical models (Olden et al. 2008, Crisci et al. 2012, Thessen 2016. Applications of machine learning models in ecology have been emerging since a turning point around 2006 (Fig. 1). The two most cited articles applying machine learning in ecology are in the field of biogeography, papers on species

Multidimensionality
The concept that a combinatory effect of multiple forces determines its consequence.
Theoretical ecology Pickett and Cadenasso (2002) Notes: For these examples, partial dependence plots with permutation-based random forest models are suitable for detecting significant nonlinear patterns. Conditional inference and model-based decision tree models are suited for inferring variable interactions. distribution modeling by Elith et al. (2006) and Phillips et al. (2006). Machine learning algorithms have demonstrated high accuracy in predicting ecological patterns, such as for data on species diversity (e.g., Olden et al. 2008) and distributions (e.g., Elith and Leathwick 2009). Machine learning is expected to discover novel patterns which provide an opportunity to speculate about underlying mechanisms in fields where prior knowledge is still minimal and hypotheses have not been clearly developed (Hochachka et al. 2007): Examples include aspects of biodiversity science (Kelling et al. 2009, Rillig et al. 2015b), macrosystem ecology (Heffernan et al. 2014, Levy et al. 2014, and microbial ecology (Baldi and Brunak 2001). Previous studies and reviews on ecological applications of machine learning have been instrumental in spreading an appreciation of machine learning approaches (e.g., Olden et al. 2008, Crisci et al. 2012, Thessen 2016. A new movement that develops algorithms merging the two relevant approaches-statistical modeling and machine learning-has been occurring since the mid-2000s. As there is no widely accepted term to lump such algorithms together, we here call them "statistically reinforced machine learning (SML)." Note that an established similar term, the so-called statistical learning exists, but this refers in a much broader sense to any statistical techniques for understanding data (Hastie et al. 2009, James et al. 2013. The fundamental aims of statistical modeling and machine learning essentially differ from each other as being theorydriven (hypothesis-testing) and data-driven (information-searching), respectively. Statistically reinforced machine learning merges both techniques and assumptions. They, therefore, are highly attractive because of a high potential to assist researchers to test hypotheses with the help of artificial intelligence.
Here, we introduce the concept of SML and some of the well-established algorithms that fit to the concept: conditional inference tree, modelbased tree, and permutation-based random forest. They share many features because the latter two algorithms were developed based on the first one. We show behaviors of these algorithms together with a linear regression by analyzing an artificially generated dataset. We also test whether the performances of the algorithms are sensitive to sample size. It is important to bear in mind that we do not intend to rank the performances of these three algorithms because they were developed for different aims with different benefits. In particular, random forest algorithms are an advanced form of single tree algorithms to better solve model over-fitting and therefore should perform more accurately than the others.

Statistically reinforced machine learning
We define "statistically reinforced machine learning (SML)" as a set of machine learning algorithms which take nonlinear associations and higher-order interactions into account automatically, while testing statistical significance and thereby conducting variable selection. Technically, SML can include semi-parametric modeling approaches that require users to specify both a finite-dimensional vector of parameters and an infinite-dimensional vector function. Practical comparisons of representative features of statistical modeling, SML, and machine learning are summarized in Table 2.
The need for reinforcement of machine learning algorithms by statistics has long been recognized (e.g., Mingers 1987, Olden andJackson 2002). This realization has changed the fundamental premise of machine learning from the idea that all information can be valuable into the idea that variable Fig. 1. Increasing applications of machine learning in ecology during the last two decades. Data were obtained by searching for "machine learning" as topic in "ecology" category in the Web of Science (Clarivate Analytics) and for "machine learning" and "ecology" in Google Scholar in October 2016. selection based on statistical significance can improve prediction accuracy (Hothorn et al. 2006b, Hapfelmeier andUlm 2013). Conditional inference tree, developed by Hothorn et al. (2006b), is seen as the cornerstone work for SML.
Conditional inference tree. -Conditional inference tree is a decision tree model (also known as classification and regression trees; Breiman et al. 1984) that is among the most frequently used machine learning algorithms. Decision tree models explain the variance in a response variable by recursively splitting the data into more homogeneous groups using a combination of predictors. At each split, the data are divided into two groups according to a threshold value of one of the predictors so that the variance after the split decreases from the variance before the split. The splitting procedure is repeatedly applied for each of the split groups until it achieves a criterion: One of the most used criteria is the moment when the model maximizes predictive performance while conducting the fewest splits (Breiman et al. 1984).
Conditional inference trees (ctree; Hothorn et al. 2006a, b) have incorporated a statistical inference framework to split the data based on statistical tests instead of maximizing prediction power. This improvement solved two fundamental problems of the traditional tree models (Hothorn et al. 2006b). First, this model decides whether the data should be divided at each split or not based on statistical significance to avoid model over-fitting (Table 3), while traditional decision tree models can often select predictors for splitting even though their effects are not statistically significant and thus can over-fit. Statistical significance of all predictors is tested independently using different statistical tests according to the combination of types of predictor and response variables (Table 3). Second, ctree model selects a predictor at each split without selection bias, while traditional decision tree models tend to preferentially select predictors in the order binary < categorical < numeric as splitting opportunities simply increase in this order (Hothorn et al. 2006b). Traditional decision tree models do not possess these important features, and these caveats have not been acknowledged in ecological studies since decision tree models were first introduced by De'ath and Fabricius (2000). The detailed procedure of ctree modeling is available in Appendix S1. These improvements, to avoid model over-fitting and variable selection bias, are also kept in the following algorithms: model-based tree and permutation-based random forest.
Model-based tree.-Another derived form of decision tree models, the model-based trees (mobtree; Note: Note that exceptional cases exist because the features depend also on data and model structure. Table 3. Statistical tests and test statistics used in conditional inference tree models (Hothorn et al. 2006a).

Response variable Numerical Categorical
Predictor Numerical Pearson's correlation (t) Notes: Pearson correlation tests whether the correlation coefficient is equal to 0 based on a t-distribution, Kruskal-Wallis test is to test whether samples that belong to different categories originate from the same distribution based on a chi-square distribution, and Cochran-Mantel-Haenszel test is to test whether the relative proportions of one variable are independent of the other variables based on a chi-square distribution. Zeileis et al. 2008), couples the features of parametric statistical models such as generalized linear models and decision tree models. Mobtree first requires explicitly specifying how predictorresponse relationships should be modeled parametrically (e.g., linear model) using a few predictors. It then automatically searches for other important predictors, which can significantly influence the parameter values of the relationships (more details in Appendix S1) based on the M-fluctuation test (Zeileis and Hornik 2007). This method is particularly useful to test a hypothesis that patterns in a relationship between a response variable and some of the predictors are altered by other predictors. For instance, Campetella et al. (2011) found out that a plant trait-environmental relationship (more specifically, the parameters of a linear regression between specific leaf area and inclination of field site) varies according to the age of forest succession at a field site.
Permutation-based random forest.-Random forest models (Breiman 2001b) construct a predictive model and estimate the relative importance of predictors; they are acknowledged as one of the most accurate machine learning algorithms to date (Douglas et al. 2011, Crisci et al. 2012). This algorithm first generates a large number of decision tree models that use diverse combinations of predictors and thresholds to explain datasets which are generated for the individual trees by sampling from the original data with replacement. Then, it takes an overall average of these tree models' outputs as a prediction (so-called ensemble/consensus modeling). The relative importance of predictors is usually measured by evaluating how much each predictor contributes to increasing model accuracy.
Recently proposed SML random forest algorithms can evaluate statistical significance of the predictors based on a permutation approach. The permutation approach generates a large number of random forest models to obtain the probability distributions of the relative importance measures of the predictors. Then, it quantifies how rarely the original relative importance measure of each predictor is obtained by chance (more details in Appendix S1). This permutation technique demands high computational performance unachievable using a typical laptop about a decade ago. Although we do not show this here, building a random forest model using only statistically significant predictors chosen with the permutation-based random forest approach generally improves accuracy (Hapfelmeier and Ulm 2013). This is another advantage of this method. An earlier work to employ a permutation approach to another machine learning algorithm within the context of SML is seen in Olden and Jackson (2002).

Generating an artificial dataset
As one of our primary aims is to assess reliability of the SML algorithms, we analyze an artificially generated dataset for which we know which and how predictors are correlated to the response variable, instead of studying an actual ecological dataset. We herein assume an ecological pattern that is structured by multiple influences, in which factors nonlinearly affect another factor and where there are factor interactions (in relevance to Table 2). Mimicking a field monitoring record, we also add considerable random noise to blur the associations. Samples are independent and identically distributed without spatial and temporal autocorrelation. The pattern can be placed in any ecological context including species distribution and abundance, species richness, community composition, and ecosystem function after de-trending autocorrelation.
Let us imagine this scenario where the goal is to understand primary production: Primary production initially increases after a disturbance and then gradually decreases during the successional period (i.e., unimodal pattern; e.g., Campbell et al. 2004), increases with an increased carbon dioxide concentration only when it is the limiting factor (i.e., positive linearity with a threshold), and such patterns may be context-dependent given a combination of climate and soil classifications (e.g., Cleveland et al. 2011). Note that we are not going to discuss results from this viewpoint. Rather, this is just an example to help readers imagine how the data structure can be considered in an ecological context.
Relevant to the above example of primary production, Y was generated in association with the predictors of x7, x8, and x1 Ã x4 Ã x9 ( Ã interaction; Fig. 2). The predictor x7 structured a unimodal relationship with Y using a cosine curve function having a peak of 5. The predictor x8 affected Y positively linearly with a slope of 0.6 until exceeding a threshold (x8 = 10). The predictor x9 affected Y negatively linearly with a slope of À0.5 when x1 = 1 and x4 6 ¼ "low" (x4 e {low, moderate, high}}) as a three-way interaction.
Finally, random noise following the Gaussian distribution (mean = 0 and SD = 3) was added to Y, which contributed to approximately 50% of the total variance of Y. The remaining 11 predictors were not associated with Y.

Model performance assessment
We tested whether the above-described SML algorithms changed the detectability of statistical significance of the associated predictors and the interactions (x1 Ã x4 Ã x9, x7, and x8) as a function of different sample sizes (n = 100, 200, and 300).
For mobtree, we focused on whether the interactive effect of x1 and x4 on the Y-x9 relationship is detected when an association of x9 and Y is presumed. For permutation-based random forest, we qualitatively assessed whether the designed Fig. 2. Artificially generated data structure (n = 300 and 16 predictors, x1-x16). Only predictors that were designed to associate with the response Y are shown. The modeled relationships between predictors (x7-x9) and the response Y are depicted with the curves and denoted in red color. patterns of x7, x8, and x1 Ã x4 Ã x9 are found, using partial dependence plots. Partial dependence plots are a technique used to visualize specified predictor-response relationships that are empirically modeled (not limited to SML). Partial dependence plots visualize the dependence of response variable on predictors while taking the effects of the other predictors into account (Hastie et al. 2009). This technique is especially useful for visually assessing the shapes of some interesting relationships but not for quantifying the effect sizes (values in y-axis do not indicate the effect size).
We also applied a linear regression model to compare with the SML models. A model includes the unimodal curve as a second-degree polynomial term, the three-way interactions (also the nested two-way interactions), and the other predictors: Y~f (x7 + x7 2 + x1 Ã x4 Ã x9 + x2 + x3 + x5 + ÁÁÁ + x16) + e. A stepwise best-model selection was conducted for both models based on Akaike's information criterion. Akaike's information criterion aims to select the best model in terms of maximizing predictability, which is equivalent to the objective of machine learning.
The R script for the entire process including data generation and analysis is available at github (see https://github.com/masahiroryo/R_ Statistically-reinforced-machine-learning). We run ctree and mobtree models with R package party (Strobl et al. 2007(Strobl et al. , 2008 and permutation-based random forest models with the R script modified from Hapfelmeier et al. (2014) in R Statistical Computing (R Development Core Team 2016). For partial dependence plots, we used the R package mlr (Bischl et al. 2016). Note that we used Bonferroni correction for all the SML models, following the convention of these methods. For simplicity, no data transformation was performed. We structured the random forest algorithm with 300 tree models after confirming this number is sufficient to stabilize the results and with 2000 permutations for estimating statistical significance of predictors.

RESULTS
The number of branches in ctree models increased along with enhancing detectability of the designed three-way interactions (x1 Ã x4 Ã x9 with negative linear x9-Y pattern) by increasing sample sizes of the dataset from 100 to 200 and 300 (Fig. 3). x1 appeared at the first nodes in the models regardless of sample size. x4 appeared under the path of x1 = 0, and x9 appeared under the path of x4 being either high or moderate when n = 200 and 300. The model structures were identical between n = 200 and 300, differing marginally in the threshold values for x9. Significance of the designed associations of unimodal and threshold-linear patterns (x7 and x8, respectively) was not inferred in any sample sizes.
The mobtree models inferred that negative x9-Y relationships were significant only in specific conditions of x1, x4 (as designed interactions), and x7 (designed unimodal pattern) when n = 200 and 300 (Fig. 4). This means that the threeway interactions of x1 Ã x4 Ã x9 were incorporated as a part of the model structure. Similar to ctree models, the model structures were identical between n = 200 and 300, differing marginally in the threshold values for x7. In the case of n = 300, the model inferred the significant negative x9-Y relationship conditional only to x7, which was not designed (i.e., false detection; node 9 in Fig. 4c). When n = 100, the model did not indicate any predictors that influence the parameters of the negative x9-Y relationship (Fig. 4a).
The permutation-based random forest models enhanced detectability of predictors associated with the response variable by increasing sample size (Fig. 5). As in the ctree models, a part of the three-way interaction, x1, was consistently selected regardless of sample size. The significances of another interaction factor (x4) and the unimodal pattern (x7) were detected when n = 200 and 300. The last part of the three-way interaction (x9) was detected only when n = 300. By increasing sample size, these signals became stronger: The values of the relative importance of these predictors increased and became more distinct from the values of the other predictors which stayed near 0.
Partial dependence plots depicted associations of predictors (x7 as unimodal, x8 as linear-threshold, x9 as negative linear conditional to x1 and x4) with the response variable modeled with the permutation-based random forest models (Fig. 6). The unimodal curve of the x7-Y relationship with the peak at x7 = 5 (Fig. 2) was adequately modeled when n = 200 and 300 (Fig. 6b, c). The conditional positive linear x8-Y relationship was  most closely captured when n = 100, although the models did not support the significance for any sample size (Figs. 5, 6). The negative linear x9-Y relationship was most closely resembled by the gradual slope when n = 300 (Fig. 6c), while a sudden shift around x9 = 10 was estimated when n = 100 and 200 (Fig. 6a, b). The three-way interactions of x1 Ã x4 Ã x9 were seemingly suggested when n = 200 and 300 (Fig. 6e, f), while two-way interactions of x1 and x9 were inferred when n = 100 (Fig. 6d).
The SML models selected only a subset of predictors which were designed to associate with the response variable (x1, x4, x7, x8, and x9), while the linear regression model selected predictors that were not designed to influence the dependent variable when n = 100 and 200 (i.e., Type I error). The linear regression models did not infer the threeway interactions (x1 Ã x4 Ã x9) for any sample size, although two-way interactions within parts of the three-way interactions (i.e., x1 Ã x4 and x4 Ã x9) were found only when n = 200. The second-degree polynomial curve of x7 was consistently significant. X8 was significant when n = 300.

DISCUSSION
We here introduced the concept of and a practical guide to SML approaches in ecological studies (Table 2) and demonstrated their usefulness for analyzing nonlinearity and higher-order interactions (Table 1) by analyzing an artificially generated dataset. Statistically reinforced machine learning approaches can be used for classification problems, while we demonstrated a regression problem. Moreover, this article introduced only a small fraction of machine learning algorithms, while many more algorithms have been progressively developed (Domingos 2012). This means that other algorithms can be more appropriate for specific ecological studies than the examples we introduced here.
Although machine learning techniques themselves are attractive as a possible alternative to statistical modeling in terms of high accuracy and the abilities of pattern-mining and variable selection, they cannot infer statistical significance. This may discourage ecologists from using them in hypothesis-testing studies. Cutler et al. (2007), who first introduced the original random forest (Breiman 2001b) to ecology, also argued this point as follows: "Random forest is not a tool for traditional statistical inference. It is not suitable for ANOVA or hypothesis testing. It does not compute P values. . . (page 2792)." This is, Fig. 5. Relative importance of predictors estimated with permutation-based random forest models. By increasing the sample size, the models detected more predictors that were designed to associate with the response. however, not true for the permutation-based random forest (Hapfelmeier and Ulm 2013).
We found that the SML approach requires an adequate sample size to guarantee both statistical significance and pattern detection. The performances of the linear regression and SML approaches depend on sample size, but they showed different dependencies on sample size. Statistically reinforced machine learning approaches detected more patterns and obtained more statistical significance with increasing sample size, while the linear regression models did not follow this trend but reduced Type I error (Table 4).
The partial dependence plots (Fig. 6) demonstrated that the unimodal curve with the peak value and the conditional negative linear effect of x9 can be adequately captured when the sample size is satisfactory (n = 300 in this study). The trend of x8, at least its positive linear effect, was somewhat captured, but this was not significant. This is probably because of the large random noise compared to the x8 influence. Fig. 6. Partial dependence plots of Y in association with x7, x8, and x9 estimated with permutation-based random forests (a-c). For x9-Y relationships, the three-way interactions of x9 with x1 and x4 are further investigated, and the negative x9-Y relationship conditional to x1 and x4 became apparent (d-f).
As the result suggested, SML algorithms may perform better than linear regressions to find higher-order interactions. The linear regression models, even though they included the higherorder interactions in their model structure, did not capture them (Table 4). By contrast, ctrees and mobtrees visually suggested these interactions without explicitly designing any variable interactions (Figs. 3, 4). Partial dependence plots with random forests can indicate higher-order interactions of nonlinear relationships (Fig. 6df). However, such interactions cannot be seen unless they are specified in partial dependence plots. We may have a risk to just interpret the relationships by only confirming the x9-Y relationship ( Fig. 6a-c). While currently lacking the ability to quantify statistical significance of higher-order interactions using random forests, recently such techniques are emerging (see Basu et al. 2017, a technique that might be able to indicate more than five-way interactions).
Caution is required when interpreting the structure of tree models. Because of the rule of recursive dichotomous separation, ctree and mobtree algorithms are not ideal tools to represent continuous gradual patterns such as curve. For the ctree models, the dichotomous separation of x9 (node 4 in Fig. 3b, c) should not be interpreted as the x9-Y relationship changing discontinuously. The mobtree models separated x7 into three value ranges (nodes 1 and 2 in Fig. 4b, c), although x7 was designed as a unimodal peak. Another caveat for interpretation is that connected nodes do not necessarily represent variable interactions (Winham et al. 2012). For example, x7 does not actually interact with the others, although the x7 node is apparently connected to some others (Fig. 4b,  c). A solution for this is to cope with knowledge about x7: If the independence of the x7 effect from the x1 Ã x4 Ã x9 is known, we can also build a multiple regression that better represents the designed relationships (Fig. 7).
The most important caveat is that SMLs are not the tools that automatically reveal what the user wants, but artificial intelligence algorithms that assist the user to interpret how the data are structured in a nonlinear and interactive manner. Together with SML approaches, ecological knowledge and theory are surely required. Further applications are needed to better clarify in which circumstances SMLs are more appropriate than machine learning and statistics, and we already present an initial comparison among these ( Table 2). These approaches should be used complementarily, depending on study aim and dataset characteristics (Breiman 2001a). Even so, we believe that SML tools offer a unique and attractive empirical modeling opportunity for ecological studies. Aiming to further stimulate applications of SML approaches, we introduce five applications in ecological contexts as follows:  100  200  300  100  200  300  100  200  300  100  200  300 x1 * *** *** *** *** *** *** *** *** x4 *** *** *** *** *** *** *** x7 *** *** * *** ** x8 ** x9 ** * (presumed) *** ** ** (x7) 2 n n n n y /n y /n n y y *** *** *** x1 Ã x4 n y y n y y n y /n y /n ** x1 Ã x9 n y y n y y n y /n y /n x4 Ã x9 n y y n y y n y /n y /n ** x1 Ã x4 Ã x9 n y y n y y n y /n y /n False 0 0 0 0 1 0 1 1 0 4 1 0 Notes: AIC, Akaike's information criterion. Qualitative assessment was also done for some cases based on Figs. 3-6, whether a model indicates the unimodal curve (x7) 2 and the variable interactions among x1, x4, and x9 (y as inferable; n as not inferable; y/n as inferable if the association is presumed or interaction detection algorithm is used). The number of predictors falsely selected with statistical significance (P < 0.05) is counted ("False" as false positive).

Novel patterns
Statistical designs and assumptions based on ecological knowledge can bias our ways of interpreting observed patterns. For example, using a Kernel density estimation (i.e., a nonlinear nonparametric approach), Chaudhary et al. (2016) showed that the latitudinal gradient in species richness is "bimodal" with a dip near the equator for marine species, while it has been generally believed to be unimodal. The authors also mentioned that the pattern may further change due to climate change. Another concern is seen in the emergence of novel ecosystems that drive patterns and processes which have never occurred previously within a given biome (Hobbs et al. 2006). In a changing climate and with increasing anthropogenic stressors, we may need to reshape classical knowledge and theory on patterns.
Statistically reinforced machine learning is also a promising approach in fields where prior knowledge is still minimal and hypotheses have not been clearly developed (Hochachka et al. 2007) such as biodiversity science (Kelling et al. 2009, Rillig et al. 2015b, macrosystem ecology (Heffernan et al. 2014, Levy et al. 2014, and microbial ecology (Baldi and Brunak 2001). Pattern-mining approaches such as using permutation-based random forests in combination with partial dependence plots (Fig. 6) provide an opportunity to objectively see patterns in observed data without subjective constraints by statistical assumptions, knowledge, and theories (e.g., Bergmann et al. 2017).

Higher-order variable interactions
The presence of higher-order variable interactions in ecological communities causes unpredictable nonlinear dynamics (Billick and Case 1994). Assuming linearity and additivity in statistical models implicitly means that more complex non-additive higher-order interactions are assumed to be negligible or absent. The validity of this assumption, however, is rarely known. Mayfield and Stouffer (2017) revealed that the consideration of higher-order interactions is inevitable to better predict species' performances in communities. Macroecology recognizes that an interplay among factors at multiple hierarchical scales significantly contributes to structuring complex nonlinear patterns, which are often unexpected and unpredictable (cross-scale interaction; Peters et al. 2007, Soranno et al. 2014).

Context dependency
Some ecological patterns as well as processes emerge only in particular conditions or vary along an environmental gradient (e.g., Tonkin et al. 2016). For example, latitude-richness relationships can be either positive, negative, or there can be no trend, depending on, for example, length of the latitudinal gradient examined and regions (e.g., Heino 2011). Burkepile and Hay (2006) suggested that the effects of human alteration of food webs and nutrient availability on primary producers vary among latitudes and primary producers, and with the inherent productivity of ecosystems. A response of an organism to an external forcing may depend on the surrounding environmental condition (Hunter et al. 2014) and its past experience (priming; Rillig et al. 2015a, Andrade-Linares et al. 2016. Searching variable interactions with the mobtree algorithm can identify such context dependencies (Fig. 4). Furthermore, these approaches can be Fig. 7. Model-based tree that presumed linear relationships of x9-Y and x7-Y (n = 300) indicated the unimodal curve of the x7-Y relationship (see Fig. 4c for comparison). This indicates that adequate a priori variable selection by expert knowledge enhances the ability of machine learning. used to integrate seemingly contradictory patterns (e.g., Heino 2011) in a merged dataset by exploring factors which differentiate patterns.
Prediction power as an alternative to statistical significance and effect size Statistical models can evaluate statistical significance and effect size of predictors based on a pre-defined assumption on data. Therefore, they might miss important associations that apparently do not exist. In addition, assuming linearity on nonlinear relationships often estimates effect size inappropriately (Gilbert and Bennett 2010). Therefore, the structure of a statistical model can be inappropriately assumed when patterns and underlying mechanisms are not well-known, which, in turn, SML can contribute to investigating such fields with non-manipulative data. Permutation-based random forests can directly evaluate the significance and importance of each predictor in terms of prediction power, which is a more honest indicator to assess variable importance than parametric statistical significance and effect size (Hapfelmeier and Ulm 2013;Fig. 5). The algorithm can evaluate importance of predictors, while automatically taking nonlinearity and variable interactions into account.
Statistically reinforced machine learning + statistical modeling. -Statistically reinforced machine learning approaches are used for confirming patterns (e.g., if linear assumption is valid or not and interactions) and variable selection before designing statistical models. Delgado-Baquerizo et al. (2016) used a permutation-based random forest as a priori variable selection and then applied a structural equation model to confirm their hypotheses. We consider that generalized additive models can help evaluate statistical significance of higher-order interactions that are suggested by SML models.

ACKNOWLEDGMENTS
The work was funded by the German Federal Ministry of Education and Research (BMBF) within the Collaborative Project "Bridging in Biodiversity Science (BIBS)" (funding number 01LC1501A). We thank M. Thomas for sharing his knowledge and helpful comments on an earlier version of the manuscript. We also thank two anonymous reviewers for constructive comments.