Model application niche analysis: Assessing the transferability and generalizability of ecological models.

The use of models by ecologists and environmental managers, to inform environmental management and decision-making, has grown exponentially in the past 50 years. Due to logistical, economical, and theoretical benefits, model users frequently transfer preexisting models to new sites where data are scarce. Modelers have made significant progress in understanding how to improve model generalizability during model development. However, models are always imperfect representations of systems and are constrained by the contextual frameworks used during their development. Thus, model users need better ways to evaluate the possibility of unintentional misapplication when transferring models to new sites. We propose a method of describing a model's application niche for use during the model selection process. Using this method, model users synthesize information from databases, past studies, and/or past model transfers to create model performance curves and heat maps. We demonstrated this method using an empirical model developed to predict the ecological condition of plant communities in riverine wetlands of the Appalachian Highland physiographic region, U.S.A. We assessed this model's transferability and generalizability across (1) riverine wetlands in the contiguous U.S.A., (2) wetland types in the Appalachian Highland physiographic region, and (3) wetland types in the contiguous U.S.A. With this methodology and a discussion of its critical steps, we set the stage for further inquiries into the development of consistent and transparent practices for model selection when transferring a model.


INTRODUCTION
The use of quantitative models (e.g., statistical, mechanistic, indices) in ecology has proliferated since the early 1970s, with an increasing number developed to inform environmental management and policy (NRC 2007, Jørgensen 2008, Jørgensen and Nielsen 2012. Due to significant logistical, economical, and theoretical benefits, model users are not only implementing models at the sites for which they are developed (herein, development sites), but are increasingly transferring models to alternative sites (herein, application sites) where data can be scarce. Models are transferred across at least one of five dimensions of context, including spatial scale (e.g., plot, watershed, region, continent), temporal scale (e.g., minutes, hours, months, years), geographic location (e.g., watershed to watershed), time (e.g., year to year), and ecological organizational unit (e.g., species, habitat type, ecosystem type; Wu and Li 2006).
Model developers are always looking for ways to improve model generalizability. For instance, some modelers have highlighted the importance of process-based model equations (Goodall 1972, Vanreusel et al. 2007, Amano 2012, Wenger and Olden 2012, or the incorporation of equations that describe physical laws (Baird et al. 2003), functional or trait-based relationships (Pont et al. 2006, Pease et al. 2015, or causal rather than correlative predictors (Bamford et al. 2009, Estes et al. 2011. Others have emphasized the importance of correctly identifying the scale(s) over which the predictor-response relationships operate (De Marco et al. 2008, Heikkinen et al. 2012, and the appropriate level of model complexity (Meynard and Quinn 2007, Fukuda and Hiramatsu 2008, McDonald and Urban 2010, Falk and Mellert 2011. Despite these efforts, model developers cannot anticipate all of the contexts that model users might transfer their models to, and it is incumbent on model users to carefully select models for their intended purpose and system. Ecological model developers and users do not have a consistent methodology for informing model selection based on the past performance of candidate models (Miller et al. 2004, NRC 2007. In a 30-yr review of the legal challenges of predictive models used in regulatory decisionmaking, McGarity and Wagner (2003) found that transferring models to contexts that differed from the model development contexts was one of the biggest vulnerabilities to their legal defensibility. Those who have argued successful legal challenges (e.g., Ohio vs. USEPA in McGarity and Wagner 2003) have done so by arguing that the models in question were used outside their respective "application niche[s]," that is, ". . .a particular prototype physical system or set of defining conditions for which the model code is scientifically defensible" (USEPA ATFERM 1994: II-2). Establishing rigorous model transfer criteria is important not only for legal proceedings, but also for model users to demonstrate transparency in their decision-making process.
We developed a method for assessing an ecological model's transferability and generalizability, which we call a model application niche analysis. Our methodology is conceptually based on Tobler's first law of spatial autocorrelation. This law states that the closer two points are in space or time the more similar they are likely to be in terms of their structure and function (Tobler 1970). Economists have adopted this approach for benefit transfer models, expanding the law of spatial autocorrelation to contextual similarity. They suggest that the more contextually similar an application site is to the model development site, the more likely it is that a model transfer will be successful (Rosenberger and Phipps 2007). The importance of contextual similarity has also been recognized in species distribution modeling (also known as ecological niche modeling). These modelers have created tools to identify when application sites (also known as projection or calibration domains) have model predictor values outside the bounds of predictor values found across sites used in model development (also known as reference or transfer domains; Elith et al. 2010, Zurell et al. 2012, Owens et al. 2013, Mesgaran et al. 2014). However, these modelers also warn that the identification of model extrapolation does not circumvent the need for model evaluation (Elith et al. 2010, Mesgaran et al. 2014. Our method evaluates the transferability and generalizability of a model by testing the importance of contextual similarity using model validation results (also known as operational validation results; Rykiel 1996) from model application sites. We acknowledge that the need to gather and evaluate model validation data could be considered a logistical challenge of this methodology. Most concerning, model validations are not performed regularly in ecological modeling. This issue has been highlighted in reviews of oceanographic models (Stow et al. 2009), aquatic biogeochemical models (Arhonditsis and Brett 2004), and individual-based ecological models (Kubicek et al. 2015). In addition to the risks associated with using model outputs that have not been validated, there are additional costs of not reporting successful and unsuccessful model validations. It can hinder one's ability to identify a model's application niche for use during model selection.
We begin this paper with an overview of our methodology for evaluating a model's application niche. We then provide a case study, where we assess a riverine wetland plant condition model's transferability and generalizability across spatial locations (i.e., transfers across the United States), across organizational units (i.e., transfers across wetland types), and across both spatial locations and organizational units. We end with a discussion of some critical considerations that affect the utility of a model application niche analysis and offer ideas on future lines of inquiry to make this approach practical for a wider range of ecological models.

METHODOLOGY
Tobler's first law of spatial autocorrelation is an underpinning of geostatistics, a set of spatial statistics used to describe the spatial and/or temporal patterning of a variable (for use in ecology, see Legendre and Fortin 1989). A common tool in geostatistics is the semivariogram-a graph of semivariance between measurements (i.e., pairwise comparisons) of a variable as a function of the physical distance (i.e., lag distance) between the measurement locations ( Fig. 1a). By fitting a Range Nugget Sill Partial Sill Model error a. b. Fig. 1. Plot (a) depicts a hypothetical semivariogram model of semivariance between measurements as a function of the physical distances between the measurement locations. Parameter estimates used to describe spatial structure are also displayed, including the nugget, the sill, the range, and the partial sill (delineated using gray shading). This curve shows a typical trend. Measurement locations closest together are autocorrelated with low semivariance. As the distance between measurement locations increases the semivariance of the variable increases, until a distance at which measurement locations are no longer autocorrelated. Plot (b) depicts a hypothetical model performance plot, describing the relationship between model error and a measure of contextual dissimilarity between the model development site and model application sites. Parameters that describe this plot include the model development nugget (C D , delineated using light purple shading), the application nugget (C A ), the application sill (C T ), the contextual range (R C ), the contextual maximum (R max ), and the model curves' maximum generalizability error (C Gmax , delineated using gray shading). One might expect model error to increase as contextual dissimilarity increases (e.g., black line). As the slope of the relationship approaches zero, the model becomes increasingly generalizable across the contextual variables used to construct the dissimilarity index. This figure was created using ggplot2 (Wickham 2009) in R (version 3.4.0; R Core Team 2017). curve to the semivariogram, we can estimate model parameters that explain the variables' spatial structure. Common parameters include the nugget, the sill, the partial sill, and the range. The nugget is the semivariance at a lag distance of zero. It represents unexplained variability due to variance that exists at finer scales than were measured and/or analytical measurement error(s). The sill is the semivariance where the curve reaches an asymptote. The partial sill is the semivariance that is left over after subtracting the nugget from the sill, and represents the spatial variance of the variable. The range is the distance at which the asymptote is reached, and defines the distance (e.g., m and km) after which two measurements of the variable are considered statistically independent from each other.
When model performance is the variable of interest, a semivariogram can be used to interpolate a model's performance across the domain of the application sites (i.e., measurement sites). This model performance map can be used to identify the model's application niche and connect model performance to underlying biogeophysical variables (for similar concepts, see Gardner andUrban 2003: 188-190, Figure 10.1, andLeathwick 2009: 681-682, Figure 1). However, data requirements to perform a semivariogram analysis and interpolation can be prohibitive. It requires the application sites to efficiently (see examples of spatial sampling methods in Burrows et al. 2002) sample the variance of the selected context (e.g., spatial, temporal, habitat type, or ecosystem type) for accurate interpolation.
We propose an approach to alleviate some of these data requirements, where the semivariogram can be modified to evaluate model performance as a function of contextual dissimilarity (i.e., Fig. 1b, model performance plot). In a model performance plot, model error replaces semivariance (y-axis), and a measure of contextual dissimilarity between the model development site and each of the model application sites replaces lag distance (x-axis). Similar to a typical semivariogram, a model can be fit to the data to estimate parameters that explain a model's transferability and generalizability. In a model performance plot, there are two nuggets, one associated with the model development error (C D ) and one associated with the model application error (C A ). C D is known as primary site error or measurement error in economic benefit transfer modeling (Rosenberger and Stanley 2006). It includes the unresolved variability in the process being modeled, and the error related to the methods and model assumptions. C A includes C D and the contextual interpolation model error. Fig. 1b shows a scenario where C A equals C D . In this case, no model error is added when transferring the model to application sites that are contextually the same as sites used in model development.
The sill (C T ) includes C A and the contextual extrapolation model error. Subtracting C D from C T results in the maximum model generalizability error, C Gmax . The model generalizability errors (C G ) make up the function (i.e., curve shape, slope) that describes the relationship between model error and contextual dissimilarity. This is the model error that arises when transferring the model to conditions that differ from those under which the model was developed Stanley 2006, Liu et al. 2011), and includes uncertainties associated with methodological errors or poor model calibration across the application sites. The model's fit (e.g., adj. R 2 , Root Mean Squared Error) describes how much of the variability in C G can be explained by the contextual dissimilarity metric being evaluated.
Finally, the contextual range parameter (R C ) defines the level of contextual dissimilarity when C G is greater than zero, or where C G crosses a predetermined level (i.e., user-defined) of acceptable model error. The larger R C is relative to the contextual gradient evaluated (R max ), the more generalizable the model is expected to be. A fully generalizable model would be expressed as a nugget model (i.e., slope of zero) that intersects the y-axis at the model error associated with C D (i.e., where C T~CA~CD and R C~Rmax ). Applying the law of spatial autocorrelation to contextual dissimilarity, Fig. 1b depicts an expected function (i.e., exponential), where model error increases as the model is transferred to application sites that are increasingly dissimilar from the context in which the model was developed.

Wetland ecological condition model
To demonstrate this methodology, we developed an empirical model that predicts a wetland's mean coefficient of conservatism (mean ❖ www.esajournals.org CoC) value (unitless, ranging from 0 to 10). This mean CoC value is a plant community index that has been used as a measure of wetland ecological condition (Lopez andFennessy 2002, Bried et al. 2013), including for wetland restoration and creation efforts (Gutrich et al. 2009, Matthews et al. 2009). To calculate a wetland's mean CoC value, each plant species is assigned a value between 0 and 10 (determined by best professional judgment, e.g., Chamberlain and Ingram 2012) based on the probability of the species' presettlement occurrence in a region and its relative tolerance to anthropogenic disturbance (Swink andWilhelm 1994, Nichols 1999). Higher values are given to plants that are sensitive to human disturbances, while lower values are given to widespread, generalist species (Wilhelm and Ladd 1988). The mean CoC value is the average of the CoCs of all plant species found within a wetland.
Species CoC values vary across regions reflecting the geographic differences in plant species tolerances and are not available in some regions of the United States (see section 5.9 in USEPA 2016). As such, a wetland mean CoC model can be useful for transfers when there is a need to prioritize the selection of wetlands for conservation and restoration efforts. We developed this model using data collected from riverine wetlands across Pennsylvania, USA, a part of the Appalachian Highland physiographic province (Bishel-Machung et al. 1996, Cleveland et al. 2011, Chamberlain and Ingram 2012, Drohan and Brooks 2013, Chamberlain and Brooks 2016. Within this dataset, we identified 107 riverine wetlands for use in model construction (Fig. 2). Twenty-two percent (n = 24) of these wetlands were excluded at random for model validation. We obtained mean CoC values for each wetland using site plant lists and the Penn State Riparian Floristic Quality Assessment Calculator (i.e., identified as the Total Mean C value in output; http://apps.cei.psu.edu/fqacalc/).
The wetland ecological condition model predicts a wetland's mean CoC value using on-site soil surface pH, and the landscape development index (LDI) for a 1.26-km 2 circular area (200 m radius) around the wetland's center ( Table 1). The LDI describes the level of anthropogenic activity in a given area and is measured as a weighted average of land cover types in that area (Brown and Vivas 2005). The coefficients for each land cover type are calculated as the normalized natural log of the emergy use per unit area per time (sej 9 ha À1 yr À1 ) and are defined as the amounts of nonrenewable energy needed to sustain a given land use activity (Brown and Vivas 2005). To calculate an LDI value for each wetland, we used a crosswalk developed by Lane et al. (2012) between the original LDI coefficients (Brown and Vivas 2005) and the land cover categories in the National Land Cover Database (NLCD). We extracted the 2011 NLCD (Homer et al. 2015) data using ArcGIS (version 10.5; Environmental Systems Research Institute, Redlands, California, USA). The data used in the construction and validation of this model can be found in Data S1 (also see Data S1 Column Explanations and Metadata S1).
We assessed our model's transferability and generalizability to (1) riverine wetlands across the contiguous United States (spatial transfer), (2) wetland types within the Appalachian Highland physiographic region (organizational unit transfer), and (3) wetland types across the contiguous United States (spatial-organizational transfer). Currently, mean CoC values (or the vegetation community compositions lists on which mean CoC values are based) and soil surface pH levels are only available for a limited number of published studies, predominantly across the eastern United States (e.g., Lopez and Fennessy 2002, Reiss 2006, Boughton et al. 2010, Dee and Ahn 2012. However, the 2011 U.S. EPA National Wetland Condition Assessment (NWCA) database (USEPA 2016) reports soil surface pH values (identified as Layer 1 for PH_H2O in the NWCA database) for 825 wetlands across the contiguous United States (Fig. 2), including 194 riverine sites and 112 Appalachian Highland sites. Mean CoC values were provided for these sites upon request, and we calculated LDI values for each wetland based on the wetland assessment area center (identified as AA_CENTER_LAT and AA CEN-TER_LON in the NWCA database). Data for the application sites can be found in Data S2 (also see Data S2 Column Explanations and Metadata S2).

Analyses
Semivariogram of model performance.-For comparison to the model application niche analysis, we performed a semivariogram analysis on model ❖ www.esajournals.org error across the full extent of wetlands in the NWCA dataset (n = 825). Model error was measured as the difference between our application sites' measured mean CoC values and their average modeled mean CoC values. We used weighted least squares to fit a powered exponential model to the semivariance of model errors across distance bins of 200 km. Distance bins ran from 0 to 4000 km, with a minimum of 30 pointpairs per bin. We subsequently used parameter estimates from the semivariogram model in ordinary kriging, with a constant mean and a global neighborhood, to interpolate model error and model error uncertainty (i.e., standard deviation of model error) across the contiguous United States. We performed our analyses in the geoR package (Ribeiro and Diggle 2001) of R (version 3.4.0; R Core Team 2017).
Model application niche analysis.-To create model performance plots for the wetland ecological condition model across model application sites, we constructed contextual dissimilarity indices and calculated a measure of model error. We created one contextual dissimilarity index based on variables intrinsic to the model (i.e., model predictors) and two contextual dissimilarity indices based on variables extrinsic to the model (i.e., variables not included explicitly in the model) that were expected to affect the model's spatial or organizational transfer. We used the absolute difference between measured mean CoC values and average modeled mean CoC values as our measure of model error. Both linear and non-linear model performance curves were evaluated using weighted least squares to extract model performance parameters across the contextual dissimilarity gradients. When a significant model fit could not be obtained, a nugget model (i.e., zero-slope model) was evaluated to determine whether there was a significant contextual interpolation error (i.e., H 0 : To create the nugget model, we used the mean model error across all application sites within the dataset. We conducted model evaluations in the base package of R (version 3.4.0; R Core Team 2017) and in JMP â Pro (version 12.2.0; 2015 SAS Institute).
1. Intrinsic contextual dissimilarity index.-We created an intrinsic contextual dissimilarity index by combining measures of univariate dissimilarity for model predictors. These included soil pH and the percent land cover categories used to construct the LDI. The LDI has the same emergy value (i.e., their weights in the LDI equation are equal) for percent cover of forests, shrubs, wetlands, and water cover types (Table 1). As such, the percent cover of these land cover classes was combined into one variable before creating the index. To create the intrinsic dissimilarity index, we used equations developed by Mesgaran et al. (2014) with minimal adaptation (i.e., we used the absolute value to obtain a positive gradient of dissimilarity). For each application site (i) and contextual variable (j) combination, we calculated a univariate distance (UD ij ) index: where A ij is the value of the contextual variable, j (e.g., soil pH, grassland, and barren lands), at the model application site, i. The min(D j ) and the max(D j ) are the minimum and maximum values  (2006) § Landscape development index (LDI) coefficients come from Lane et al. (2012), who created a crosswalk between land cover classifications used by Brown and Vivas (2005) and land cover classifications in the National Land Cover Database (Homer et al. 2015).
❖ www.esajournals.org of contextual variable, j, found across the model development site. When UD ij = 0, it means that the value for j at i is within the range of the value of j at sites used in model development (i.e., min(D j ) ≤ A ij ≤ max(D j )). When UD ij > 0, it means the value for j at i is outside the range of the value of j at sites used in model development One approach would be to create model performance plots for each univariate dissimilarity index, separately. However, in our case study few application sites fell outside any one predictor's range. As such, we opted to calculate an additive multivariate dissimilarity index (Intrinsic MD) for each application site, i, by combining the UD values across all contextual variables (j): where p (p = 10) is the total number of contextual variables included in the index.
2. Spatial contextual dissimilarity index. -We hypothesized that the model would be more transferable to riverine wetlands in regions of the United States that have naturally derived acidic soil conditions, and less transferable to riverine wetlands in regions of the United States that have naturally derived neutral-to-basic soil conditions. Soil pH varies widely within the Appalachian Highland physiographic region and is known to affect vegetative community composition (Simon et al. 2005). Acidic soils naturally occur in Pennsylvania, due in part to the abundance of rainfall and subsequent leaching of alkaline ions (Blumberg andCunningham 1982, Pabian andBrittingham 2007). This is particularly true in higher elevation forested wetlands, which have acidic shale-derived soils. Wetlands in these acidic soils have plant communities with higher mean CoC values reflecting a higher proportion of specialist species. Contrastingly, lower mean CoC values are generally found in wetlands located in Pennsylvania's valleys. Valleys have neutral-to-basic soils, due in part to geology (i.e., Alfisols formed over carbonate and limestone ;Miller 1995) and in part to land use activities (e.g., liming). We expected the model to underestimate the mean CoC values of application sites with naturally derived neutral-to-basic soil, particularly in regions with low human impact. To test this hypothesis, we gathered data for contextual variables associated with soil formation, including mineralogical and geochemical data for the A horizon and C horizon ( We created a weighted multivariate spatial dissimilarity index using directional univariate distance indices. We chose a weighted index, because we did not expect all univariate distance indices to have an equal effect on model error. We also chose to use directional univariate distance indices, because we did not want to assume that the effects of context on model error would be the same above and below the model development site's contextual range. For these metrics only values above or below the model development site's range were considered dissimilar. We used stepwise linear regression (backward and forward) to narrow down the number of directional univariate distance indices that explained most of the variance in model error. We subsequently evaluated combinations of these univariate indices to identify the fewest number that explained the greatest amount of variance in model error (using Akaike's information criterion values, analyses of variance, and adj. R 2 ). Slope coefficients from the selected linear model were used as weights for each contextual variable in the multivariate dissimilarity index.
The spatial multivariate dissimilarity index (Spatial MD) for application site, i, was composed of directional elevation (DE) and temperature (DT) gradients with the following weights: In other words, riverine application sites in higher elevation regions and in higher mean annual temperature regions had larger model errors, compared to riverine application sites at ❖ www.esajournals.org elevations and mean annual temperatures found within or below the model's intended range. We performed model evaluations using the MASS package (Venables and Ripley 2002) and the base package of R (version 3.4.0; R Core Team 2017).
3. Organizational contextual dissimilarity index.-We evaluated the model's transferability and generalizability across wetland types within the Appalachian Highland physiographic region by developing a multivariate dissimilarity index based on the commonly used hydrogeomorphic (HGM) classification system (Brinson 1993). This qualitative scheme identifies differences in wetlands based on their water source(s), hydrodynamics, and geomorphic setting. Peterson-Smith et al. (2009) found that HGM classes (i.e., headwater floodplains, riparian depressions, and slope wetlands) were related to soil properties and plant community composition across this region (e.g., New York, Pennsylvania, Virginia, and West Virginia). Further, Bried et al. (2013) found that the relationship between wetland condition and mean CoC values varied between hydrogeomorphic settings (i.e., hydrologically connected and isolated wetlands) of non-forested sites in upstate New York.
First, we created a dissimilarity index across HGM classifications, by identifying the probability (i.e., 0 = not likely, 0.5 = possible, 1.0 = likely) or frequency (i.e., daily = 0, seasonally = 0.5, annual = 1) of occurrence for six attributes related to water sources and hydrodynamics (Table 2). We assigned values to each HGM class, based on best professional judgment. For each HGM class, q, the HGM dissimilarity index (HGM q ) was calculated as: where P[R a ] is the probability (or frequency) of occurrence in a riverine wetland for attribute, a (a = 1 . . . r, where r is the number of attributes), and P[HGM aq ] is the probability (or frequency) of occurrence for attribute, a, for HGM class, q. The absolute differences were summed across all attributes for each HGM class. Hydrogeomorphic classification rankings fell in the order: riverine wetlands, depressions, fringing wetlands, flats, with slopes and tidal wetlands being equally different from riverine sites. This broke application sites into quantitative groups based on their HGM q . However, we also considered a few site-level differences. We added a water salinity index, WS it , where t describes the water salinity class (i.e., fresh, brackish, or saline) for application site, i. These salinity classes were assigned in the field for the application sites (identified as WATER_SALINITY in the NWCA database), and classes were given ordinal rankings (i.e., fresh = 0, brackish = 0.5, and saline = 1.0). Similar to the calculations for HGM q , we calculated this index using the absolute difference between the water salinity class found at sites used in model development (i.e., fresh) and the salinity class at each of the application sites. We also considered using a plant community index (i.e., based on aquatic bed, herbaceous, scrub/shrub, and forested classes). However, the model development dataset included the full range of vegetation categories. As such, this index would have no effect on the relative differences between application sites along the dissimilarity index. It should also be noted that We used best professional judgment to determine a probability (i.e., L = Likely, P = Possible, and U = Unlikely) or frequency (i.e., D = Daily, S = Seasonal, and A = Annual) category for each HGM classification. Probability (i.e., L = 1, P = 0.5, and U = 0) and frequency (i.e., D = 0, S = 0.5, and A = 1) categories were converted to ordinal ranks, to produce a HGM dissimilarity index that quantified the similarity of wetland types to riverine wetlands. Bried et al. (2013) did not find a significant effect of vegetative wetland type (i.e., pine barren vernal pond, wet sedge meadow, shrub swamp) on the relationship between wetland condition and mean CoC values. We calculated the wetland type dissimilarity index (Organizational MD), for each application site, i, using: where HGM iq is the HGM dissimilarity index for application site, i, and WS it is the water salinity index, given its salinity class, t, for application site, i.

Results
Model performance summary.-While the wetland ecological condition model was suitable for most application sites, the 95% prediction intervals failed to capture the measured mean CoC values at 30% of riverine application sites (Fig. 3a) and 17% of Appalachian Highland application sites (Fig. 3b), compared to 4% of the model validation sites. This was primarily due to under-prediction of the mean CoC values (Table 3). Under-prediction could be problematic if the model is to be used in wetland conservation and restoration decisions; it would suggest these wetlands have a lower level of ecological condition than they actually do.
Semivariogram of model performance.-The semivariogram analysis indicated significant spatial structure across model error in the NWCA dataset. Approximately 25% of the variance in model   Notes: These results include the slope, adjusted R 2 (adj. ‡ The National Wetland Condition Assessment (NWCA) dataset can be found at https://www.epa.gov/national-aqua tic-resource-surveys/data-national-aquatic-resource-surveys. error could be explained by spatial autocorrelation (i.e., sill = 0.62, nugget = 1.88; Fig. 4a). Our interpolated map of model error revealed that the model underestimated the mean CoC values at wetlands along the coastlines of Florida and the Gulf of Mexico (Fig. 4b). This map also revealed lower model performance in the Rocky Mountains (underestimation) and in parts of the Great Lakes region (underestimation), California (underestimation), and the Great Plains (overestimation). However, the semivariogram range was relatively low, estimated at 1151 km. Since most wetlands were further than 1151 km apart, uncertainty in model error predictions (expressed as the standard deviation of model error) was relatively consistent (i.e., ranging between 1.4 and 1.6) across the contiguous United States (Fig. 4c).
Model application niche analysis: spatial transfer. -The Intrinsic MD index could not be used to explain variability in model error across riverine application sites (Fig. 5a). One issue was that only 44% of riverine application sites were dissimilar from wetlands used in model development (i.e., Intrinsic MD i index > 0). When riverine application sites were dissimilar, it was primarily attributable to the presence of grasslands or the levels of soil pH. Forty-two percent of riverine application sites with an Intrinsic MD i index > 0 had higher percentages of grasslands compared to wetlands used in model development (i.e., model development grassland cover: 0-2%). Twenty-nine percent of riverine application sites with an Intrinsic MD i index > 0 were outside the soil pH range found in wetlands used in model development (i.e., model development soil pH 4.09-7.47). The majority of these riverine application sites had more basic soil conditions. An additional 12% of riverine application sites with an Intrinsic MD i a.
b. c. d. index > 0 differed from wetlands use in model development for both their percentages of grasslands and their levels of soil pH.

Range
We subsequently tested a nugget model to see whether there was a significant contextual interpolation error (i.e., H 0 : C A = C D ). The application nugget (C A = 0.96 AE 0.05) was significantly higher (i.e., Welch two-sample t-test: t = 3.32, df = 38.75, P-value = 0.001) than the model development nugget (i.e., C D = 0.59 AE 0.10). Given the lack of model fit across the Intrinsic MD index and the significant interpolation error, we concluded that the riverine application sites' generalizability error could not be satisfactorily explained by extrapolation of the model to values outside the model's predictor ranges. Further, the contextual interpolation error could be attributed to a lack of model calibration at riverine application sites, or issues of model structure due to extrinsic contextual variables not considered during model development.
We were able to explain a significant portion of model error (i.e., Model Error i = 0.71 9 exp [0.82 9 Spatial MD i ], residual squared error [RSE] = 0.66, df = 192) using the Spatial MD index (Fig. 5b). Seventy-four percent of riverine application sites had a Spatial MD index >0. The application nugget (C A = 0.71 AE 0.05) was no longer significantly different (i.e., Welch twosample t-test: t = 1.07, df = 38.16, P = 0.291) from the model development nugget. This lack of contextual interpolation error suggested that the model was generalizable to riverine application sites within or below the elevation (i.e., <638 m above sea level) and mean annual temperature (i.e., <11.6°C) ranges found at sites used in model development. All of the variability in model error that was explained by the Spatial MD index was attributed to the contextual extrapolation error (C A = 2.66, C Gmax = 2.07). We used the upper standard error (SE U ; Fig. 5b upper dark purple dotted line) of the model validation data as a threshold to determine R C (i.e., C G > SE U ). Using this criterion, R C was approximately zero. Based on this criterion, the model was not generalizable to riverine application sites above the maximum elevation and mean annual temperature values found at sites used in the model's development.
Model application niche analysis: organizational transfer.-The Intrinsic MD index could also not explain variability in model error across Appalachian Highland application sites (Fig. 5c). Thirty-five percent of Appalachian Highland application sites were dissimilar from wetlands used in model development (i.e., Intrinsic MD i index >0). Similar to riverine application sites, 33% of the Appalachian Highland application sites with an Intrinsic MD i index >0 had higher percent cover of grasslands than sites used in model development. Fifty-four percent of Appalachian Highland application sites with an Intrinsic MD i index >0 had soil pH levels outside the range found at sites used in model development. Unlike the more basic soils in riverine application sites, the Appalachian Highland wetlands were more acidic than wetlands used in model development. Again, the application nugget (C A = 0.88 AE 0.10) was significantly higher (i.e., Welch two-sample t-test: t = 2.11, df = 73.51, P = 0.038) than the model development nugget (i.e., C D = 0.59 AE 0.10), representing a significant contextual interpolation error.
We were able to explain a significant portion of model error (i.e., Model Error i = 0.56 + 0.004 9 exp [1.61 9 Organizational MD i ], RSE = 0.71, df = 109) using the Organizational MD index in the Appalachian Highland region (Fig. 5d). Seventy-one percent of Appalachian Highland application sites had an Organizational MD index >0. The application nugget (C A = 0.58 AE 0.05) was not significantly different (i.e., Welch twosample t-test: t = À0.18, df = 63.94, P = 0.859) from the model development nugget. Not surprisingly, this suggested that the model was generalizable to freshwater riverine wetlands within the Appalachian Highland region. All of the variability in model error that was explained by the Organizational MD index was attributed to contextual extrapolation across wetland types (C A = 3.08, C Gmax = 2.49). Using the contextual range (R C = 2.14), we concluded that the model was generalizable for 54% of the contextual range assessed (R max = 4.00). Taken together, we concluded that the model was more generalizable across wetland types within the Appalachian region than across riverine sites within the United States.
Model application niche analysis: spatialorganizational transfer.-We then used a model performance heat map to simultaneously assess the spatial and organizational transferability of our model (Fig. 6). As indicated by our model performance plots, the wetland ecological condition model was more transferable to model application sites within or below the elevation (i.e., <638 m above sea level) and mean annual temperature (i.e., <11.6°C) ranges found at sites used in model development (Fig. 6a). This held across all but the most dissimilar wetland types (i.e., Organizational MD index >3), where mean model error was relatively high across the full extent of the Spatial MD index. The most dissimilar wetlands included tidal wetlands, slopes, and brackish-to-saline flats. Further, we assessed the generalizability of our model by setting an acceptable mean model error (userdefined), and subsequently counting the number of cells that fell within this criterion. We excluded contextual dissimilarity spaces with fewer than two application sites (i.e., where standard error could not be calculated, shown as gray cells in Fig. 6b) and used a criterion that the model was generalizable in cells with a mean model error ≤1. We concluded that this model was generalizable across 35% of the spatial-organizational contextual dissimilarity space assessed.

DISCUSSION
A model's utility can vary between development and application sites, due to differences in the strength of predictor-response relationships (Whittingham et al. 2007, McAlpine et al. 2008, the direction of the relationships (Murray et al. 2011), and the predictors to be included in the model (Leftwich et al. 2012). With potential limits on resources for data collection and the high stakes associated with transferring models to  6. Model performance heat maps overlaying the Spatial multivariate dissimilarity (MD) index and the Organizational (i.e., wetland type) MD index for application sites in the National Wetland Condition Assessment database (i.e., data were available for n = 478). Each cell in the heat map bins model application sites by their unique combinations of the indices. Plot (a) shows the mean model error for application sites in each contextual dissimilarity space (i.e., cells), and it is expressed as the absolute difference between measured and modeled mean coefficient of conservatism values (unitless). Light gray cells in this heat map highlight the contextual dissimilarity space where no application sites fell. Plot (b) depicts the standard error (SE) of the model error for each contextual dissimilarity space. In this heat map, light gray cells identify the contextual dissimilarity space where fewer than two application sites fell. This figure was created using ggplot2 (Wickham 2009)  new sites, there is value in developing tools that can better inform our model selection decisions. Both economic benefit transfer modelers Stanley 2006, Rosenberger andPhipps 2007) and species distribution modelers (Elith et al. 2010, Zurell et al. 2012, Owens et al. 2013, Mesgaran et al. 2014 have advanced our thinking about the importance of contextual dissimilarity when transferring models. We have expanded on these ideas by constructing a straightforward quantitative methodology to evaluate a model's application niche across contextual dissimilarity gradients. The power of our methodology lies in its flexibility and ability to leverage preexisting datasets (and/ or model results) to inform the model selection process. This methodology can be used for both empirical (e.g., statistical, indices) and processbased models, where the model user chooses what contextual variables to assess and what dissimilarity and validation metrics to use. We discuss some considerations of these choices in the subsections that follow.

Contextual variables selection
To perform a model application niche analysis, one must identify the biological, chemical, and/or physical contextual variables to evaluate. As illustrated in our case study, it is useful to organize contextual variables into two major categories, intrinsic and extrinsic. Intrinsic contextual variables define the limits of the model's predictor hyperspace (also known as covariate space in species distribution modeling)-the space encompassed by all valid combinations of predictors in the model. When evaluating intrinsic contextual variables, two issues should be addressed: extrapolation beyond the extent of the data used to develop the model, and the model's signal-tonoise ratio (Wiens 1989). While the signal-to-noise ratio can be directly evaluated by looking at the original model data and uncertainty, model performance plots can be used to evaluate the model error associated with the model's extrapolation across intrinsic dissimilarity gradients.
Intrinsic dissimilarity gradients have been the focus of tools developed in species distribution modeling to assess the risks of model extrapolation (Elith et al. 2010, Zurell et al. 2012, Owens et al. 2013, Mesgaran et al. 2014. Extrapolation outside the predictor hyperspace is less problematic for simple monotonic relationships, in which predictor-response relationships continue in the same trajectory (Miller et al. 2004). For most monotonic relationships, we expect to see a nugget model that intersects the y-axis at the model error associated with C D . Extrapolation is more problematic for non-linear relationships that are truncated (Anderson and Raza 2010) due to environmental limits or sampling bias at sites used in model development (Anderson and Gonzalez 2011). In these cases, we expect the model performance plot to have a nonzero slope function in the extrapolated region of the predictor hyperspace (i.e., C A~CD < C T ).
As was the case for the wetland ecological condition model, it is also possible for a model to perform poorly at application sites that exist within the model's predictor hyperspace. This results in a contextual interpolation error (i.e., C D < C A ). Depending on the circumstances, this can indicate a range of issues, including methodological differences between model development and applications sites; the lack of model calibration at the application sites; and/or that the model's structure might not hold across extrinsic contextual variables. When care has been taken to calibrate the model and minimize methodological errors, the next step is to assess model performance plots across extrinsic contextual variables.
Gradients across extrinsic variables are less likely to be considered or are held constant during model development, potentially limiting model generalizability and transferability. As such, extrinsic variables are commonly cited as the cause of low predictability during model transfers. For example, extrinsic variables (e.g., regional climate, biogeography, and land use) have been cited as causes for low predictability in wetland HGM functional models (Cole et al. 2002), multi-scale models of koala distributions (McAlpine et al. 2008), nitrogen dioxide land use regression models (Poplawski et al. 2009, Allen et al. 2011, and the River Dee-based G-BASH model (Cresser et al. 2006), as well as for poor transfers of model parameters (e.g., Duan et al. 2006, Gan andBurges 2006). Economic benefit transfer modelers have established lists of relevant extrinsic variables, for users to consider during model selection (Spash and Vatn 2006). Few analogous lists (e.g., factors of soil formation; Jenny 1941) currently exist for ecological models. However, geospatial datasets are widely available and offer a wide range of contextual variables for analyses (e.g., those used in the creation of the Spatial MD index).

Constructing contextual dissimilarity indices
Once contextual variables are selected, one must decide what types of dissimilarity metrics are most appropriate for model evaluation. Linear and continuous univariate variables (e.g., temperature, salinity, elevation) are particularly well suited for model performance plots and heat maps. They can be arranged along axes based on their natural gradients and viewed as arithmetic or rank differences. It is also possible to assess multiple contextual variables by collapsing them into additive (e.g., Intrinsic MD index) or weighted (e.g., Spatial MD index) multivariate indices. This is a useful approach when multiple contextual variables are expected to impact model performance within or among application sites (e.g., spatial transfer of the ecological condition model).
Numerous multidimensional metrics (hierarchical clustering, ordination, etc.) are available for the development of contextual dissimilarity indices (McCune and Grace 2002). Species distribution modelers have built indices and complementary tools, including the Multivariate Environmental Similarity Surface tool (Elith et al. 2010), the Mobility-Oriented Parity tool (Owens et al. 2013), and the Extrapolation Detection tool (ExDet; Mesgaran et al. 2014). Methods have also been described for the development of multidimensional metrics that combine mixed categorical and numeric datasets (e.g., Lee and Yun 2003). Each metric has its own nuisances (for discussions, see Owens et al. 2013, Mesgaran et al. 2014, which should be carefully considered on a case-by-case basis before implementation. Many multidimensional metrics (and validation metrics) are limited by their lack of directionality, which can be important during a model application niche analysis. For example, two application sites that are the same number of units above or below the model's range for a given contextual variable are given the same dissimilarity value in the ExDet tool (Mesgaran et al. 2014). In our case study, we addressed this issue by separating upper and lower ranges into directional univariate distance metrics before the construction of our Spatial MD index. When developing contextual dissimilarity metrics, one should also consider the distribution of the data used in model development; in most cases, dissimilarity indices are based on a variable's mean, median, or range. This can be problematic as ecological data are often strongly skewed, and the model error can vary across the predictor hyperspace.
Qualitative variables (many of which are also multivariate) are also commonly used in ecology to explain context, but are challenging to convert into contextual dissimilarity metrics (for an example approach, see Lee and Yun 2003). When the qualitative variable has some hierarchical structure (e.g., the Cowardin wetland classification system, Cowardin et al. 1979; biogeographic regimes, for map, see USEPA 2011), tree distance metrics might be used. However, many of these variables are more complex, including ecological structural or functional classification schemes (e.g., land cover, trophic guild, soil type, phylogenic, and eco-region). For these variables, best professional judgment practices can be used, including the deconstruction of the classification scheme into component context variables, and subsequently assigning similarity rankings (e.g., Organizational MD index).

Collection of application site validation data
The third critical step in this methodology is to collect and synthesize quality validation data from model application sites across the contextual dimension(s) of interest. Validation data can be taken directly from the literature, when a model has already been transferred to application sites, or be calculated, by synthesizing information from databases (e.g., illustrated in our case study) and/ or previous studies that contain predictor and response variables. Good candidate models for a model application niche analysis are those that have been applied broadly. Some models that fit this description include CENTURY (Parton et al. 1987; see reference list at https://www.nrel.colosta te.edu/projects/century/), DNDC (Li et al. 1992a, b; see reference list at http://www.globaldndc.net/inf ormation/publications-i-3.html), PnET (Aber and Federer 1992; see reference list at http://www.pnet. sr.unh.edu/publications.html), SWAT (Engel et al. 1993, Srinivasan and Arnold 1994, Arnold et al. 1998, see model review Arnold et al. 2012), SLAMM (Park et al. 1986; see reference list at http://warrenpinnacle.com/prof/SLAMM/SLAMM_ Bibliography.html), and LANDIS-II (Scheller and Mladenoff 2004; see reference list at http://www. landis-ii.org/publications). Unfortunately, many of these models are transferred without any formal exploratory niche analysis and without complete validation at the application sites. However, with proper documentation on model calibration and validation, such widely used models are well suited for performing model application niche analyses, with potential long-term benefits to future model users.
It should also be noted that the utility of this methodology can be affected by the quantity and quality of the independent data sources used in the model calibrations and validations (e.g., Rastetter 1996). When methodological errors are large, or model calibration is not performed at application sites, it can hinder the identification of model structural issues during the extrinsic dissimilarity analysis. For example, we did not calibrate the wetland ecological condition model for application sites. While there were clear reasons to doubt the model's structure for saline and tidal wetland application sites, where other variables (e.g., salinity, tidal hydrology, and storm surges) have already been identified as influential to mean CoC values (Cretini et al. 2012), we could not distinguish calibration issues from model structural issues when evaluating the spatial transfer to riverine sites. Upon further inspection, we found that the riverine dataset was significantly correlated with both soil pH and LDI (i.e., mean CoC = À0.31 9 pH À 0.40 9 LDI + 6.58506, adj. R 2 = 0.33, P < 0.001), suggesting that calibration might improve the model's transferability to riverine sites across the US. Our objective was to demonstrate the applicability of the method we developed, and how it can be used to diagnose confounding factors influencing model performance. If a user is interested in investigating the transferability of our wetland ecological condition model and understanding the ecological processes generating these differences in model performance, we strongly recommend calibrating the model for the chosen context.
When validation results are taken directly from the literature, it can be challenging to synthesize these results. With few exceptions (e.g., Landis andKoch 1977, Ara ujo et al. 2005), there have been no generally agreed-upon validation criteria in ecological modeling (Rykiel 1996, Gardner and Urban 2003, Stow et al. 2009). Although progress has been made on the classification of validation methods (e.g., Bennett et al. 2013), there is a growing need for consistency in the validation metrics, and transparent approaches for establishing criteria for what constitute acceptable results across the range of tests performed. For example, the popularity of visual validation methods employing pattern recognition (Stow et al. 2009, Bennett et al. 2013, Kubicek et al. 2015 complicates the goal of standardization. Publication bias can further reduce the completeness of a model application niche analysis by emphasizing only those contextual hyperspaces where the model has been transferred successfully.

Future work
Improving the transparency and consistency in ecological model transfers is critical for building confidence in the management decisions that are being made based on those transfers. Currently, our methodology is most practical for (1) simple models where predictor and response datasets can be gathered from databases and the literature, and (2) when models have already been validated at application sites across a contextual gradient, and where model users have maintained good documentation on model calibration and validation for both successful and unsuccessful transfers. We do not recommend using this method for models where generalizability is assumed to be weak at the offset. This is frequently found when modeling distributions of common or widespread species (e.g., butterflies, studied by Kharouba et al. 2009). It is difficult to distinguish between suitable and unsuitable habitats for these species due to their phenotypic plasticity, or ability to adapt (Randin et al. 2006, Zanini et al. 2009).
Despite the inherent challenges for some ecological study systems (e.g., widespread species, high plasticity), we see possibilities for broadening the usefulness of the model application niche analysis methodology by (1) creating databases of geospatial dataset for contextual variables that are known to affect the transferability of ecological models, (2) developing contextual dissimilarity metrics and tools that can account for complexities such as qualitative variables and data distributions, (3) increasing the number of validations performed when models are transferred, (4) developing generally agreed-upon validation criteria and/or standardizing methods, and (5) building databases for model application validation results (e.g., EcoService Models Library described in Bruins et al. 2016), which include both successful and unsuccessful transfers. This methodology sets the stage for further inquiries into the development of consistent and transparent practices for use during model selection.