The importance of simulation assumptions when evaluating detectability in population models

Population monitoring is important for investigating a variety of ecological questions, and N-mixture models are increasingly used to model population size (N) and trends (k) while estimating detectability (p) from repeated counts within primary periods (when populations are closed to changes). Extending these models to dynamic processes with serial dependence across primary periods may relax the closure assumption, but simulations to evaluate models and inform effort (e.g., number of repeated counts) typically assume p is constant or random across sites and years. Thus, it is unknown how these models perform under scenarios where trends in p confound inferences on N and k, and conclusions regarding effort may be overoptimistic. Here, we used global positioning system data from greater sage-grouse (Centrocercus urophasianus) to inform simulations of the detection process for lek counts of this species, and we created scenarios with and without linear annual trends in p. We then compared estimates of N and k from hierarchical population models either fit with single maximum counts or with detectability estimated from repeated counts (dynamic N-mixture models). We also explored using auxiliary data to correct counts for variation in detectability. Uncorrected count models consistently underestimated N by >50%, whereas N-mixture models without auxiliary data underestimated N to a lesser degree due to unmodeled heterogeneity in p such as age. Nevertheless, estimates of k from both types of models were unbiased and similar for scenarios without trends in p. When p declined systematically across years, uncorrected count models underestimated k, whereas N-mixture models estimated k with little bias when all sites were counted repeatedly. Auxiliary data also reduced bias in parameter estimates. Evaluating population models using scenarios with systematic variation in p may better reveal potential biases and inform effort than simulations that assume p is constant or random. Dynamic N-mixture models can distinguish between trends in p and N, but also require repeated counts within primary periods for accurate estimates. Auxiliary data may be useful when researchers lack repeated counts, wish to monitor more sites less intensively, or require unbiased estimates of N.


INTRODUCTION
Broad-scale, long-term monitoring is invaluable for basic and applied research, including characterizing population trends and evaluating ecological theories (Nichols 1991, Pollock et al. 2002, Lindenmayer et al. 2012. Many monitoring programs conduct counts of individuals, where expected count E(c) is the product of detectability (p) and the population size of interest (N): E(c) = pN. As such, counts may yield useful indices of trends in N if detectability is constant (or varies independently of N; Johnson 2008, Nichols et al. 2009). However, p can vary over space and time due to factors such as observer differences and vegetation structure, and this potential for error has long invited criticisms of indices (Nichols 1992, Anderson 2001, Ellingson and Lukacs 2003, White 2005. Specifically, trends in indices may be misleading when trends in p also occur (Link and Sauer 1998, K ery 2008, Yamaura 2013, Veech et al. 2016, which could confound inferences, frustrate conservation planning, and waste limited resources. Because many monitoring programs involve counts, N-mixture models (Royle 2004) offer the potential to analyze these data while accounting for variation in detectability (K ery et al. 2005(K ery et al. , 2009(K ery et al. , McCaffery et al. 2016. Using a robust design where repeated counts are collected from a population that is closed within years or seasons, the observed count (c ijt ) at site i on survey j in year t can be modeled as the outcome of a binomial distribution, given population size N it and detection probability p ijt : c ijt | N it $ binomial(N it , p ijt ). Importantly for modeling population trends, this approach uses information from repeated counts to distinguish annual changes in N from systematic changes in p, whereas these processes are confounded in analyses of indices (K ery 2008(K ery , Yamaura 2013. Simulation studies are useful for evaluating models such as N-mixture, including the level of effort (e.g., repeated counts) needed to distinguish trends in N from p (K ery and Royle 2016). Such studies are necessary because monitoring data may lack repeated counts from sites for historical reasons or because limited resources for monitoring force a trade-off between temporal replication and spatial representation. If monitoring data occur across a series of years, one could analyze single counts collected each primary period using dynamic N-mixture models (Dail and Madsen 2011), assuming abundance in a given year is a function of population size and change from the previous year. This serial dependence across years may alleviate the need for repeated within-season counts during closed secondary periods to distinguish between trends in N and p. However, simulations that evaluate dynamic N-mixture models typically assume that p is constant or varies randomly and independently of N across sites and years (Dail and Madsen 2011, Zipkin et al. 2014, Hostetler and Chandler 2015, Bellier et al. 2016). Under such conditions, detectability is unlikely to confound inferences regarding the latent population trend (Johnson 2008, Nichols et al. 2009), and therefore, it is unknown whether dynamic N-mixture models can distinguish trends in N from p, particularly with few or no repeated counts. Furthermore, without considering systematic changes in detectability, inferences may not differ between N-mixture models and models of indices, and researchers could conclude equivalence (e.g., Barker et al. 2018).
To illustrate these implications, we use the greater sage-grouse (Centrocercus urophasianus, hereafter sage-grouse), a species of conservation concern whose populations are monitored across western North America. Each spring, male sagegrouse gather at breeding display sites (leks) to court females. As with many bird species, detectability of sage-grouse during lek counts can be thought of as the product of four distinct components (Table 1; Nichols et al. 2009). Two components relate to the distribution and movement of individuals, including probability that a bird's home range overlaps the sampling unit (p s ; in the case of sage-grouse, attending a lek at least once in a season) and probability of presence at the sampling unit during the count (p p ). The other two components characterize the detection process for birds given their presence at the sampling unit: probability of being available for detection during the count (p a ; e.g., not concealed by vegetation or topography) and probability of detection conditional on availability (p d ; observer correctly perceives and counts an individual). A major source of variation in detectability for sage-grouse occurs because lek attendance (p p ) varies with date and time (Beck and Braun 1980, Emmons and Braun 1984, Walsh et al. 2004, Fremgen et al. 2019, Wann et al. 2019. As a result, trends in sage-grouse populations are typically estimated using the maximum observed number of males from repeated counts each year at each lek (hereafter, peak counts; Connelly et al. 2003). Additionally, detectability of grouse may be influenced by visual obstruction from vegetation, snow, and topography (Fremgen et al. 2016) as well as visibility changes due to weather and time of day (Baumgardt et al. 2017). Nevertheless, factors that influence p or N independently are estimable even with indices (Yamaura 2013, Banks-Leite et al. 2014). More concerning in our view are factors affecting detectability components that also vary with population size (temporally and spatially). For example, trends in vegetation cover following disturbance such as wildfire could affect both detectability (Blomberg et al. 2013, Fremgen et al. 2016) and population size (Blomberg et al. 2012). Changes in observer skill also may coincide with population trends (Link and Sauer 1998). Previously, simulations indicated that N-mixture models recovered population trends despite reduced effort or missing lek counts (McCaffery et al. 2016), but it is unknown how the model performs under scenarios with systematic changes in p.
Using sage-grouse lek counts as a model study system also offers opportunities to identify model limitations and potential solutions. For example, aggregating individual encounter histories into counts results in information loss because p is assumed constant among individuals (K ery and Royle 2016, Barker et al. 2018). Variation in detectability among individuals is well recognized (Otis et al. 1978) and may lead to underestimates of population size with N-mixture models (K ery and Royle 2016, Veech et al. 2016. Individual heterogeneity in lek attendance (p s and p p ) could occur for sage-grouse because yearling males attend leks at different rates than adults (>1 yr old; Jenni and Hartzler 1978, Walsh et al. 2004, Wann et al. 2019), yet age is difficult to ascertain during counts. Yearlings also may remain on the periphery of leks where they are less visible (Jenni and Hartzler 1978). Additionally, some males in a given year skip attending leks altogether (Blomberg et al. 2013, Gibson et al. 2014, and therefore, they may belong to the population but are not sampled during lek counts with probability 1 À p s . In either case, N-mixture models cannot account for these sources of detection bias, yet the consequences for abundance and trend estimates are unclear. One potential solution to unmodeled heterogeneity in detectability is incorporating auxiliary data in analyses (Martin et al. 2015, Ketz et al. 2018, which can be useful particularly for datasets that lack repeated counts. Here, we evaluate performance of population assessment models under scenarios with confounded trends in abundance and detectability, using lek counts of sage-grouse as a model study system. We estimate population trends from scenarios with and without declining trends in p across years, and we compare estimates of population size and trends from dynamic N-mixture models and models where the response is the maximum observed number of males (uncorrected peak counts). For realism, we use empirical estimates of sage-grouse lek attendance and sightability to inform simulations of detectability in count data. Additionally, we evaluate effects of age and sampling probability (p s ) on estimates, and the performance of estimators when auxiliary data are used to correct counts for changes in detectability.

Study area and lek attendance data
To characterize temporal variation in lek attendance of males during spring, we monitored daily movements of sage-grouse at eight sites distributed throughout northern Nevada from 2012 to 2017 (Wann et al. 2019). These sites occurred in sagebrush-steppe and desert ecosystems with average site elevation ranging between 1732 and 2181 m. Each site contained 2-11 known active leks (i.e., with two or more male sage-grouse displaying at a lek during the study period). Additional details on local climate and floristic composition of sites are reported in Wann et al. (2019).
We collected global positioning system (GPS) locations from 79 individually marked male sage-grouse to determine when they were present on leks. All sage-grouse were captured in accordance with the U.S. Geological Survey Western Ecological Research Center Animal Care and Use Protocol (WERC-2015-02), and we describe sage-grouse capture and aging methods in Appendix S1. We fit captured males with a 30g GPS platform transmitter terminal (North Star Science and Technology, King George, Virginia, USA), and we used a solar array and rechargeable batteries to power the GPS units that had an operational life between two and three years. We programmed the GPS units to collect one to three locations within three hours of sunrise (when birds are typically on leks), and we used the PTT Tracker decoding software to download locations (version 1.0.0.3.7; GeoTrak, Inc. 2011).
To determine whether grouse locations occurred at leks, we mapped lek boundaries in the spring and summer of 2016 and 2017 (Appendix S1). Error in GPS locations was generally <26 m but could be as high as 100 m. To reduce influence of this uncertainty, we restricted analyses to locations with error <26 m and we buffered lek polygons by 25 m. We also restricted analyses to data collected 1 January to 30 June and between 60 min before sunrise through 180 min after sunrise, and we considered males present on a lek if a given location occurred inside a buffered lek polygon during this time window. Several individuals were monitored over multiple years (16 over two years, six over three years, and one over four years), which yielded 110 individual-year attendance records including 85 adults and 25 yearlings.

Lek attendance model
We jointly modeled annual lek attendance probability (attending a lek at least once in a season; p s ) and within-season attendance probability (p p ) using a zero-inflated Bernoulli model with complete interactions for age (yearling or adult). We began by modeling annual lek attendance of individual-year k (X k ) as the outcome of a Bernoulli model with age-specific probability p s ageðkÞ , where age(k) denotes a subscript for age (1 = adult, 2 = yearling) depending on the individual-year: X k $ Bernoulli p s ageðkÞ . We then modeled whether each individual was present at a lek during observation j (y jk = 1) with another Bernoulli distribution conditional on p p and the latent state of an individual attending a lek at all within a year (X k = 1): y jk jX k $ Bernoulli p p jk X k .
We fit age-specific covariates for p p on the logit scale (b age(k) ) including ordinal date, minutes after sunrise (time), and random terms for year (x year(k) ) and site (e site(k) ; see Study area and lek attendance data): logit p p jk ¼b 0;ageðkÞ þ b 1;ageðkÞ date jk þ b 2;ageðkÞ date 2 jk þ b 3;ageðkÞ date 3 jk þ b 4;ageðkÞ time jk þ b 5;ageðkÞ time 2 jk þ x siteðkÞ þ e yearðkÞ : Random terms were normally distributed with mean 0 and standard deviation r. For comparison and to inform our count corrections, we also analyzed location data without accounting for differences in age (no-age model) because age is typically not recorded during counts. Before analyses, we standardized continuous covariates by subtracting the mean and dividing by the standard deviation. We analyzed both models using a Bayesian approach with JAGS (version 4.2.0; Plummer 2003) and the rjags package (version 4-6; Plummer et al. 2016) in R (version 3.3.2; R Development Core Team 2016). After discarding 15,000 posterior samples during the adaptive phase, we saved 100,000 samples from each of three parallel chains for the age model, and 85,000 samples for the no-age model. We provide details on priors and assessment of convergence in Appendix S2.

Population trend and lek count simulations
We simulated scenarios with k = 0.9, 1.0, and 1.1 for declining, stable, and increasing population trends, respectively. For each population trend, we evaluated effects of trends in detectability by creating three sets of simulations: (1) where p varied randomly (without trends across years); (2) with annual trends in p p (probability of presence); and (3) with annual trends in p a p d (availability and detection, see below). For brevity, we did not consider scenarios with annual trends in both p p and p a p d although these may occur in practice. To evaluate the influence of repeated counts on parameter estimates, we created scenarios with the same counts and population sizes but where we randomly assigned three repeated counts to a proportion of leks (0.3, 0.5, 0.75, and 1), resulting in 36 scenarios (three population trends 9 three detectability trends 9 four repeated count proportions). Leks without repeated counts were counted once each year. Finally, we assessed effects of unmodeled heterogeneity in detectability due to mixed ages of sage-grouse in observed samples by using the same scenarios except that all individuals were adults. Thus, we simulated 72 unique scenarios and we generated 250 random datasets for each scenario.
We begin by distinguishing between the population size of individuals associated with a lek (N) and the superpopulation size of individuals that attend a lek at least once in a year (E(N*) = Np s ). This distinction is important because the latter population size is typically the parameter of inference for N-mixture models (i.e., Nichols et al. 2009), whereas the former includes individuals in the population that are not sampled with counts. We also note that although the number of individuals present can vary among lek counts (secondary periods) because p p < 1, population sizes (N and N*) are constant within a lekking season and vary across years. Each simulated dataset consisted of 100 leks monitored over 10 yr, with initial population size for each lek i in year t = 1 (N i1 ) drawn from a negative binomial distribution with expected abundance (Λ) = 30 and overdispersion parameter (a) = 5, where r 2 = Λ + (Λ 2 /a). We simulated changes in the latent population size across years with an exponential model where N it was a function of the previous year's population size and the rate of population change (k; Hostetler and Chandler 2015): To account for differences in annual and within-season attendance probabilities between yearlings and adults, we assumed that 70% of individuals at each lek and year were adults ðN adult it Þ and the rest were yearlings (N yearling it ; Walsh et al. 2004 For the detection process, the number of adults in the superpopulation N Ã adult it was drawn from a binomial distribution given latent population size N adult it and age-specific sampling probability (annual lek attendance; p s,adult ): The observed count of adults on occasion j c adult ijt was then a binomial outcome given p adult ijt and N Ã adult it : where p adult ijt is the product of probabilities of presence, availability, and detection (p p;adult ijt p a p d t ). We specified age-specific covariates for p p (i.e., b 0a intercept for adults) on the logit scale including date and time: We used mean posterior estimates from the age model (see Lek attendance model) as coefficients for p s , date, and time parameters. For availability and detection probabilities, we assumed their product was the mean estimate previously reported from sightability models (p a p d t = 0.87; Fremgen et al. 2016). We similarly simulated counts of yearlings c yearling ijt , and the observed count used in analyses was: c ijt ¼ c adult ijt þ c yearling ijt . For scenarios with linear annual trends in detectability, we included a negative effect of year (b 6 = À0.1 on the logit scale and standardized) to either p p or p a p d .
For each simulated dataset, we randomly assigned three dates (March 15-May 15) and three times (30 min before to 90 min after sunrise) to each lek each year corresponding with count protocols for when lek attendance is highest (Connelly et al. 2003). We then recorded date ijt , time ijt , and c ijt for analysis, and N it (the true, simulated population size) and N Ã it (the superpopulation size) for comparisons of estimator performance. We also determined the maximum count for each lek and year (c it ) for analyses with peak count models. We provide R code for simulating datasets via ScienceBase (https://doi.org/10.5066/p91l28pg).

Model comparison
We analyzed each dataset with six hierarchical population models where each model had the ❖ www.esajournals.org same state process but a different observation process depending on whether the response was peak counts or repeated counts. We used an exponential model for the state process, and initial population sizes were estimated from a negative binomial distribution: N i1 $ NB(Λ, a). We characterized the observation process with either a Poisson (P) or negative binomial (NB) distribution for peak counts (e.g., c it $ Poisson(N it )), or a binomial observation process for repeated counts (Nmix). We fit N-mixture models with either a linear annual trend on p (b 6 ) for scenarios with annual trends or a random term varying by year (x t ) for scenarios without annual trends. To study the use of auxiliary data for peak count models, we used parameter estimates from the no-age lek attendance model (see Lek attendance model) to correct peak counts for annual and within-season lek attendance rates (P+ and NB+). We used mean posterior estimates of annual attendance (p s ), coefficients for within-season attendance (p p ), and simulated values for availability and detection (p a p d ) and applied these to peak counts (c it ), given the date and time associated with each count: For scenarios with linear trends in p p or p a p d , we assumed that auxiliary data contained information on such trends and these were incorporated in count corrections. For N-mixture models, auxiliary data could help account for variation that is not estimable, such as p s . Here (Nmix+), we used estimates of p s from the no-age attendance model and moment matching (Hobbs and Hooten 2015) to create an informative beta prior for p s , where l = a/(a + b), r 2 = (ab)/ ((a + b) 2 (a + b + 1)), and p s $ beta(a, b). We then used this prior to estimate N where N Ã it jN it $ binomialðN it ; p s Þ and c ijt jN Ã it $ binomialðN Ã it ; p ijt Þ. Using informative priors based on other analyses (rather than jointly estimating multiple models) is useful for reducing computation time of analyses (Martin et al. 2015).
For each model, we monitoredN (estimated mean total population size: 1=T P T t¼1 P n i¼1 c N it for n = 100 leks and over T = 10 yr) andk (mean rate of population change) for comparison with the simulated, true values of N, N*, and k, and we saved 40,000 posterior samples from each of three chains after discarding 10,000 iterations from the adaptive phase. We summarized simulations with % bias ([est À truth]/truth 9 100%), root mean square error (RMSE), proportion of 95% credible intervals (CrI) containing the simulated value (coverage), and credible interval length. We report priors in model code available in ScienceBase. Finally, for each model we evaluated fit using posterior predictive checks (Bayesian P values) based on chi-square discrepancy test statistics (K ery and Royle 2016, Conn et al. 2018, and we considered P values <0.1 and >0.9 indicated lack of fit.

Lek attendance
Probability of attending a lek at least once in a season (sampling probability p s ; Table 1) was higher for adults (p s adult = 0.89, 95% credible intervals [CrI] = 0.82-0.95) than yearlings (p s yearling = 0.75, 95% CrI = 0.57-0.90). Overall annual lek attendance from the no-age model (p s ) was 0.87 (95% CrI = 0.79-0.93), which is equivalent to beta(88.2, 13.2) that we supplied to Nmix+ models. We estimated substantial variation in withinseason attendance probability (p p ), with presence peaking in early April and declining after sunrise (Fig. 1). We also estimated differences in attendance by age class, with attendance lower and somewhat later in the season and morning for yearlings than adults. Without accounting for age differences, within-season attendance probabilities were slightly lower than those of adults (Fig. 1). For simulation scenarios with a linear annual trend (b 6 = À0.1), median probability of presence (p p ) declined from 0.43 at t = 1 (range = 0.25-0.50) to 0.24 at t = 10 (range = 0.12-0.50) for yearlings, and from 0.72 at t = 1 (range = 0.45-0.81) to 0.52 at t = 10 (range = 0.25-0.81) for adults. For availability and detection (p a p d ), this same trend resulted in a decline from 0.91 to 0.81 over 10 yr.

Population size (N and N*)
Overall, bias in population size estimates varied little among scenarios with different annual trends in N or p ( Fig. 2; Appendix S3). Estimates of population size were most biased from peak count models without corrections for p (P and NB), and these models underestimated N* by >40% and N by >50%. Absolute bias and RMSE were lowest, and coverage was highest, forN from models with corrections for detectability ❖ www.esajournals.org (P+, NB+, and Nmix+). All peak count models increasingly underestimated N and N* as the proportion of sites with repeated counts decreased, resulting in overestimates of N from P+ and NB+ models when all sites were counted repeatedly and underestimates when 30% of sites were counted repeatedly. Bayesian P values for negative binomial models suggested adequate fit (0.2 < P value < 0.8), whereas Poisson models generally indicated lack of fit (P value >0.9; Appendix S3: Tables S1, S3 and S5). N-mixture models without an informative prior for p s (annual lek attendance; Nmix) performed relatively well in estimating the number of individuals attending leks at least once (N*) when all sites were counted repeatedly, and yielded less bias than peak count models without corrections but still underestimated N by~18% (Fig. 2). Informative priors for p s reduced absolute bias inN from Nmix+ when all leks were counted repeatedly but still underestimated N bỹ 6%. N-mixture models did not suggest lack of Fig. 1. Within-season lek attendance probability (p p ) of sage-grouse by date (top) and minutes after sunrise (bottom) with differences by age (left) or averaged across age classes (right). Date predictions were based on mean time of 104 min after sunrise, whereas time predictions were based on mean date of 18 April. Shaded regions denote 95% credible intervals. Vertical red dashed lines represent the start and end dates (March 15-May 15) or time intervals (30 min before-90 min after sunrise) of lek counts. Fig. 2. Violin plots summarizing % bias ([est À truth]/truth 9 100%) of posterior means of mean total population size estimates (N, averaged over 10 yr) relative to the true population size (N) and superpopulation size (where E(N*) = Np s ) for different simulated values of population rate of change (k) and proportion of sites with repeated counts. Detectability (p) was the product of probabilities of sampling (p s ), presence (p p ), and availability and detection (p a p d ), and scenarios were created either without a linear annual trend in p (A), with p p declining annually (B), or with p a p d declining annually (C). Each model used either a Poisson (P) or negative binomial (NB) observation process with peak counts, or a binomial observation process with repeated counts (Nmix). We also corrected peak counts for variation in p (P+ and NB+) or included an informative prior for p s (Nmix+). Violin plots denote the density of % bias from 250 simulated datasets for each scenario, black horizontal lines indicate the median from each scenario, and dashed lines represent no bias in parameter estimates.
fit when most sites were counted repeatedly, but as the proportion of sites with repeated counts decreased to 0.3, absolute bias and RMSE increased, coverage decreased, and Bayesian P values indicated lack of fit (Appendix S3: Tables S2, S4 and S6).

Population trends (k)
All models generally performed well in estimating k for scenarios without annual trends in p, and all but one model yielded similarly unbiased estimates of k (Fig. 3A) and low RMSE (Appendix S3: Tables S7 and S8). The exception was for NB+ models that tended to underestimate k slightly. Estimates of k were more precise (shorter credible interval lengths) from peak count models than N-mixture models, but this precision often excluded the true simulated value despite overall lack of bias in mean trend estimates. Coverage of k tended to be greater for N-mixture models (0.94-0.99; Appendix S3: Table S8) than models for peak counts (0.05-0.64; Appendix S3: Table S7). Coverage also was somewhat less for P+ and NB+ than P and NB models.
Most models performed worse under scenarios with annual trends in probability of presence (p p ). Peak count models consistently underestimated k when p p declined annually, and absolute bias ink increased with fewer repeated counts and as k increased (Fig. 3B). Correcting peak counts for detectability improved % bias, RMSE, and coverage for Poisson and negative binomial models (Appendix S3: Table S9). N-mixture models also produced more biased estimates when annual trends in p p occurred but performed well when all sites were counted repeatedly each year. As the proportion of leks with repeated counts decreased, N-mixture models increasingly underestimated k, and specifying an informative prior for p s had little effect onk from Nmix+ ( Fig. 3B; Appendix S3: Table S10). Similar patterns emerged for scenarios with trends in p a p d (Fig. 3C), but estimates tended to be less biased, RMSE smaller, and coverage greater (Appendix S3: Tables S11 and S12) than for scenarios with trends in p p .

Effects of age on model accuracy
When we removed the effect of age on attendance by simulating all individuals as adults, all models yielded somewhat higher estimates of N and N* (Appendix S4: Fig. S1) than scenarios with age differences (Fig. 2). However, this resulted in overestimates of N and N* when we corrected peak counts for detectability (P+ and NB+; Appendix S4: Fig. S1, Tables S1, S3 and S5). N-mixture models with and without an informative prior for p s showed greater coverage, reduced RMSE, and increased precision (shorter credible interval length; Appendix S4: Tables S2, S4 and S6) compared to scenarios with age differences (Appendix S3: Tables S2, S4 and S6), but again coverage decreased and absolute bias and RMSE increased with fewer repeated counts (Appendix S4: Fig. S1). Measures of fit followed similar patterns as scenarios with age differences. Removing age differences had little effect onk (Appendix S4: Fig. S2, Tables S7-S12).

DISCUSSION
Accounting for variation in detectability when estimating population size and trends is an active field of research, yet researchers seldom evaluate how models respond to systematic changes in detectability that could confound inferences on latent parameters of interest, and instead use simulation scenarios that assume p is constant or random over space and time. Despite substantial variation in lek attendance by date and time, population trend estimates were similarly unbiased from dynamic N-mixture and peak count models when detectability did not vary systematically across years. Recovery of simulated population trends also was not affected by reduced effort (fewer repeated counts). However, we found that these scenarios offered an incomplete assessment of models, and this was apparent when we simulated trends in both population size and probability of presence (p p ) or probability of availability and detection (p a p d ), and therefore trends in p. In these cases, dynamic N-mixture models estimated k with little bias when all leks were counted repeatedly, peak count models without corrections underestimated population trends, and performance of peak count models improved when auxiliary information was incorporated into analyses. Absolute bias from N-mixture models also increased as the proportion of leks with repeated counts decreased. These results illustrate the importance of fully exploring the response of Fig. 3. Violin plots summarizing % bias ([est À truth]/truth 9 100%) of posterior means of mean population rate of change (k, averaged over 10 yr) for different simulated values of k and proportion of sites with repeated counts. Detectability (p) was the product of probability of sampling (p s ), probability of presence (p p ), and probability of availability and detection (p a p d ), and scenarios either lacked a linear annual trend in p (A), p p declined annually (B), or p a p d declined annually (C). See Fig. 2 caption for model name definitions. Violin plots denote the density of % bias from 250 simulated datasets for each scenario, black horizontal lines indicate the median from each scenario, and dashed lines represent no bias in parameter estimates. population models such as dynamic N-mixture to systematic changes in p when evaluating model performance. Scenarios with trends in p also may better identify situations where incorporating auxiliary data in analyses is useful for inferences.
Our results agree with previous studies indicating that N-mixture models are robust to reduced survey effort when estimating trends in lek populations while p is constant or varies randomly (McCaffery et al. 2016). Similarities between N-mixture and uncorrected count models also were observed for datasets where p was constant or random (Yamaura 2013. Our study suggests that these previous results likely did not indicate a particular attribute of the N-mixture model, but instead reflected the absence of trends in p within these study systems (or the lack of covariates to estimate such trends). When p is constant, or variation in p is random and therefore the expectation (on average) is constant, detectability is less likely to confound inferences on population trends (Monroe et al. 2016). Simulation scenarios with constant (or random) p are still useful for certain objectives, for example, to evaluate sensitivity of models to different levels of detectability (e.g., low vs. high p; Zipkin et al. 2014). However, without also simulating scenarios with trends in p, researchers assume, paradoxically, the same conditions under which indices can yield unbiased estimates of k and therefore obviate the need for methods that account for detectability when estimating population trends. Simulations with constant or random p also may encourage use of models and study designs that cannot distinguish between potential trends in p and N, such as uncorrected peak counts or dynamic Nmixture models with few or no repeated counts.
Most models under consideration here produced biased estimates of N. This was expected for uncorrected peak count models, and N-mixture models performed well in estimating the number of individuals that attended leks (N*) when all leks were counted repeatedly. As with population trend estimates, survey effort had a strong influence onN, and absolute bias from uncorrected peak count and N-mixture models increased as the proportion of sites with repeated counts decreased. Additionally, both annual (p s ) and within-season (p p ) lek attendance of sagegrouse varied with age, and unmodeled age structure contributed to bias inN. Previous studies found adult male sage-grouse attended leks earlier and more frequently than yearlings (Emmons and Braun 1984, Walsh et al. 2010, Wann et al. 2019. Yearlings also may attend alternative leks more often than adults (Emmons and Braun 1984, Schroeder and Robb 2003, but see Gibson et al. 2014, Fremgen et al. 2017, Wann et al. 2019, contributing to lower lek attendance rates. In addition, we estimated that some males did not attend a lek at all in a given year, and therefore could not be detected by lek counts. This phenomenon may be attributed to the hourly timing of our GPS recordings, although attendance rates in our study were similar to previous estimates from Nevada (Blomberg et al. 2013). Males not sampled also could be attending alternative leks (especially those unknown to observers), skipping reproduction, or mating offlek (Blomberg et al. 2013, Gibson et al. 2014). We found dynamic N-mixture models underestimated N by ≥18%, and absolute bias inN decreased when we assumed all individuals were adults because yearlings attended leks less frequently than adults and therefore were less likely to be counted. Dynamic N-mixture models still underestimated N by ≥9% in scenarios without yearlings due to individuals that did not attend leks, and additional data and models to estimate annual lek attendance (p s ) and yearling abundance could be useful when estimates of absolute population size are needed , Knape et al. 2018. In contrast, unmodeled heterogeneity from age and attendance had little effect onk from N-mixture models and so benefits from auxiliary data were more limited. Still, we assumed age ratios and annual attendance rates were constant across space and time, and thus, our results do not preclude the use of auxiliary data to account for trends in either factor.
Adjusting peak counts with auxiliary data generally yielded less biased estimates of k and N for scenarios with declining trends in p and often outperformed N-mixture models for scenarios where ≤75% of leks were counted repeatedly. In the Great Basin of western North America, sagegrouse populations are increasingly threatened by wildfire and exotic grass spread, and counts conducted at >1000 leks are used for monitoring and to inform management (Coates et al. 2016).
Given that resources for monitoring are finite, detectability corrections could increase spatial representation by studying more leks less intensively. However, this approach also assumes that auxiliary data are comparable and appropriate to the study system (e.g., reflecting the population's age structure), and we found that corrected count models overestimated N for scenarios with only adults, likely because attendance rates were based on data from adults and yearlings. Considering the extent of sage-grouse distribution and monitoring (Schroeder et al. 2004), additional covariates and replication over space and time may be needed for auxiliary data to be useful. Weighing the costs and benefits of collecting GPS data or conducting lek counts presents opportunities for a formal optimal sampling framework (e.g., Sanderlin et al. 2014, Reich et al. 2018, which could inform monitoring design while considering the resources and effort required for adequate inference. Goodness of fit based on posterior predictive values indicated N-mixture models fit the data adequately when all leks were counted repeatedly but suggested increasing lack of fit as the proportion of sites with repeated counts declined. This occurred similarly across scenarios for both Nmix and Nmix+ and corresponded with increasing absolute bias inN but not ink. Additionally, posterior predictive checks were conservative, often indicating lack of fit (P value >0.9) when N was underestimated by >40% with Nmix and >20% with Nmix+. Lack of power to detect bias with posterior predictive checks and other goodness-of-fit tests also was noted in recent studies (Conn et al. 2018, Duarte et al. 2018). Posterior predictive checks were even less informative for peak count models and corresponded with neitherN nork. These results may reflect the limitations of posterior predictive checks when evaluating bias in estimates of latent variables and processes (K ery and Royle 2016). Furthermore, goodness-of-fit tests may perform worse when detectability is low and sample sizes are small (Duarte et al. 2018), although we did not evaluate this directly in our study. Additional simulation studies and more targeted discrepancy measures, graphical methods, and cross-validation could provide further insights into model adequacy (K ery and Royle 2016, Conn et al. 2018, Knape et al. 2018).
Although we simulated constant rates of annual lek attendance (p s ) across leks and years, attendance may vary systematically and could bias trend estimates from lek counts (Blomberg et al. 2013). For example, density dependence was previously estimated using peak counts (Garton et al. 2011, Monroe et al. 2017, Edmunds et al. 2018), but attendance also may be density-dependent (Blomberg et al. 2013) and could confound estimates of density dependence ink. Similarly, cumulative winter precipitation influences soil moisture and productivity of semi-arid ecosystems dominated by sagebrush (Artemisia spp.; Schlaepfer et al. 2012) and was positively correlated with peak attendance rates in Nevada (Wann et al. 2019). We also assumed the population age structure was constant across years, but the proportion of yearlings (with lower attendance rates than adults) will likely vary with reproductive success and recruitment rates. In addition, the relative contribution of individual detectability components to trends in p warrants further study, and we found scenarios with negative annual trends in p p produced more bias ink than scenarios with similar trends in p a p d . Probability of presence also was generally lower than p a p d , and lower detectability can contribute to greater bias in parameter estimates (Yamaura 2013, Zipkin et al. 2014, K ery and Royle 2016, Veech et al. 2016, Duarte et al. 2018). Finally, we assumed no within-season mortality, but this too could influence p s and p p (Fremgen et al. 2017), and the implications for monitoring and population trend analyses are worthy of additional study.
We conclude that dynamic N-mixture models can distinguish between trends in p and N with a robust design where all or most sites are counted repeatedly each year, whereas these processes are confounded in models of indices and can lead to biased estimates of population trends. Use of auxiliary data also may be helpful for estimating N and k, particularly when repeated counts are not obtainable. Although variation in within-season lek attendance (p p ) was the greatest contributor to imperfect detection in our study system, this should not constrain our inferences to study systems with lekking species. Indeed, temporary emigration can bias estimates of population size among other bird species (Chandler et al. 2011). Our results can be extended to other systems where N-mixture models are applied, including single-season studies with closed populations that vary across space rather than time (K ery 2008, Yamaura 2013, Veech et al. 2016, and more broadly to other hierarchical models that account for detectability. For example, occupancy models fit with constantp overestimated occupancy trends because detection (which varied annually with abundance) and occupancy were confounded (Latif et al. 2018). Occurrence or prevalence of systematic changes in p is difficult to generalize because this likely varies by study system and research question. Increasing spatial and temporal extent of monitoring and randomization during study design may reduce, but not eliminate, the risk of systematic biases in detection. We therefore encourage researchers to investigate the response of models to trends in detectability and suggest caution when using dynamic N-mixture models with few or no repeated counts because these models may be susceptible to the same biases that arise from models of indices. Similarly, a previous study found that with sufficient spatial replication (>50 leks), only one count per lek each year was needed for adequate inferences on population trends (Fedy and Aldridge 2011), but our results suggest this may require revisiting in the context of trends in detectability. We also recommend additional consideration of potential biases from unmodeled variation in detectability, such as individual heterogeneity and sampling probability (p s ).

DATA AVAILABILITY
R scripts, JAGS code, and data needed to run simulations and fit models are available via ScienceBase (https://doi.org/10.5066/p91l28pg).