Categorization of species as native or nonnative using DNA sequence signatures without a complete reference library

Abstract New genetic diagnostic approaches have greatly aided efforts to document global biodiversity and improve biosecurity. This is especially true for organismal groups in which species diversity has been underestimated historically due to difficulties associated with sampling, the lack of clear morphological characteristics, and/or limited availability of taxonomic expertise. Among these methods, DNA sequence barcoding (also known as “DNA barcoding”) and by extension, meta‐barcoding for biological communities, has emerged as one of the most frequently utilized methods for DNA‐based species identifications. Unfortunately, the use of DNA barcoding is limited by the availability of complete reference libraries (i.e., a collection of DNA sequences from morphologically identified species), and by the fact that the vast majority of species do not have sequences present in reference databases. Such conditions are critical especially in tropical locations that are simultaneously biodiversity rich and suffer from a lack of exploration and DNA characterization by trained taxonomic specialists. To facilitate efforts to document biodiversity in regions lacking complete reference libraries, we developed a novel statistical approach that categorizes unidentified species as being either likely native or likely nonnative based solely on measures of nucleotide diversity. We demonstrate the utility of this approach by categorizing a large sample of specimens of terrestrial insects and spiders (collected as part of the Moorea BioCode project) using a generalized linear mixed model (GLMM). Using a training data set of known endemic (n = 45) and known introduced species (n = 102), we then estimated the likely native/nonnative status for 4,663 specimens representing an estimated 1,288 species (412 identified species), including both those specimens that were either unidentified or whose endemic/introduced status was uncertain. Using this approach, we were able to increase the number of categorized specimens by a factor of 4.4 (from 794 to 3,497), and the number of categorized species by a factor of 4.8 from (147 to 707) at a rate much greater than chance (77.6% accuracy). The study identifies phylogenetic signatures of both native and nonnative species and suggests several practical applications for this approach including monitoring biodiversity and facilitating biosecurity.


INTRODUCTION
The genomics revolution is transforming the studies of conservation biology, ecology, and evolution (Hudson 2008, Allendorf et al. 2010. New affordable sequencing technologies, coupled with decreases in traditional sequencing costs and in the time required to process specimens (see, for example, Pomerantz et al. 2018) have made it possible to sequence DNA from entire communities of individuals (Borisenko et al. 2008, Kress et al. 2009, Cristescu 2014 and allow for the production and publication of genomic information from non-model organisms from across the tree of life (Ekblom and Galindo 2011). One approach for collecting and analyzing sequence data that has become widely utilized is known as DNA sequence barcoding (also known as "DNA barcoding"; Hebert et al. 2003, Savolainen et al. 2005, Ratnasingham and Hebert 2007. This technique has been used in many contexts; including biodiversity inventories , cryptic species discovery (Hebert et al. 2004a), species identification (Hebert et al. 2003, 2004b, Kress et al. 2005, species delimitation (Pons et al. 2006), biomonitoring (Pilgrim et al. 2011), biosecurity (Saunders 2009, Collins et al. 2012, Dejean et al. 2012, Porco et al. 2013, Thomas et al. 2016, for phylogenetic and population genetic studies (Hajibabaei et al. 2007), and for observations of within-species genetic diversity (e.g., Johnson et al. 2002, Havill et al. 2018. While there are numerous well documented limitations to the uses of DNA barcoding, including when the technique is used to reconstruct ancient evolutionary relationships, when non-specific amplification is not accounted for, and when fixed intra-and inter-interspecific thresholds are utilized (Moritz and Cicero 2004, Thalmann et al. 2004, DeSalle et al. 2005, Meyer and Paulay 2005, Rubinoff et al. 2006, Buhay 2009, Moulton et al. 2010, one of the core uses for DNA barcoding is the comparison of query sequences to reference DNA sequences to determine the percentage of sequence similarity. As such, the primary objective for most DNA barcoding projects is to obtain species-level identifications from specimens that for one reason or another lack a morphologically derived species-level identification. In order to do so, however, a researcher first requires access to a well-curated reference library that includes DNA sequences from previously identified species, ideally including all potential species matches. While numerous efforts to produce these reference libraries are currently underway (e.g., Blagoev et al. 2015, Cardoni et al. 2015, Gwiazdowski et al. 2015, Oliveira et al. 2015, Savage et al. 2015, Warne et al. 2015, Xu et al. 2015, Iftikhar et al. 2016, Lobo et al. 2016, Rimet et al. 2016, Yang et al. 2016, Bell et al. 2017, Pennisi 2017, until complete reference libraries exist for all taxonomic groups and for all geographic locations, the comparison of a query sequence to sequences in public reference libraries such as the Barcode of Life Database (BOLD; Ratnasingham and Hebert 2007) or GenBank (Benson et al. 2013), or a researcher's private database, will result in one of the following four outcomes: (1) a high-percentage match to a specimen with a species-level identification, (2) a low-percentage match to a specimen with a species-level identification, (3) a high-percentage match to a sequence from an unidentified specimen (usually noted as some variation of "Undet sp"), or (4) a low-percentage match to a sequence from an unidentified specimen. In the case of the first outcome, the investigator can use the specieslevel identification to inform their research/management decisions, though one needs to be cautious when utilizing these identifications in public reference libraries due to the many known misidentifications (Buhay 2009, Tixier et al. 2011). In the case of the last three outcomes, after the expenditure of time, money, and resources, the specimen in question remains unidentified, and for most research objectives provides little utility, though in some instances higher level categorizations (i.e., genus, family, order) may be obtained and be sufficient. While the collection of DNA barcode sequences from unidentified specimens provides useful genomic data, unfortunately at the same time that DNA barcoding techniques are being used with increasing frequency to guide management decisions (e.g., Park et al. 2015), particularly for the identification of alien invasive species (Saunders 2009, Dejean et al. 2012, Porco et al. 2013, Thomas et al. 2016, the taxonomists needed to provide the foundational species-level identifications to create reference libraries are themselves becoming endangered species (Gotelli 2004, Agnarsson et al. 2007, W€ agele 2011, Wheeler 2014)! The Moorea BioCode project was established, in part, to address the coupled needs for increased sampling of underrepresented taxa in biodiverse tropical regions and hands-on training for taxonomic specialists (Field and Davies 2015). As such it represents the first comprehensive inventory of all non-microbial life in a complex tropical ecosystem (Meyer 2017). The Moorea BioCode project has resulted in the DNA sequencing of both terrestrial and aquatic organisms from a variety of diverse taxonomic groups (Nitta 2008, Nitta et al. 2011, Diaz et al. 2013, Bonnet and Lotufo 2015, Leray et al. 2015, Ramage et al. 2017. However, as for many other community sequencing projects, the effort collected specimens (>20,000), and obtained DNA sequences (~5,000) from our focal taxonomic groups (spiders and insects) at a rate much faster than the specimens could be assigned to morphologically based species-level identifications or even species-level identifications through traditional DNA barcoding approaches.
This challenge in identification prompted the question as to whether one could apply basic concepts from invasion biology to categorize unidentified specimens and/or species whose biogeographic status (i.e., native or nonnative) is unclear. For example, if one assumes that the establishment of most introduced species into novel habitats results in genetic bottlenecks (Dlugosch and Parker 2008), then we might predict that DNA sequences from nonnative species would have lower levels of intraspecific diversity compared to those from native species. Similarly, if nonnative species establish because they can utilize a novel niche (MacDougall et al. 2009), avoid detection from native natural enemies (Keane and Crawley 2002), are uniquely superb dispersers (Gillespie et al. 2012), and/or are more genetically distinct from native members of the community (Strauss et al. 2006), it is likely that the genetic distance to the nearest neighbor species in the collection might be greater for nonnative species then for native species (i.e., phylogenetic distance; see review by Gallien and Carboni 2017). Because the native/nonnative status of a species is often difficult to determine in natural communities, here we propose to use a training data set of known introduced and known endemic species with the assumption that introduced and endemic species might be at opposite ends of the native/nonnative spectrum and that if the above measures of genetic diversity are distinct between native and nonnative species, that they should be even more so between endemic and introduced species. Thus, the objectives of the current study were (1) to determine whether metrics of within species similarities and between species distances differ between known endemic and known introduced species in French Polynesia, (2) to construct a statistical model based on measures of genetic differences between, and genetic variation within known endemic and known introduced species, (3) to use the model to predict the likely native or likely nonnative status of unidentified and uncategorized specimens, and (4) to explore possible ecological applications for the method.

Field work, specimen sampling, and laboratory protocols
Specimens were collected in the field, assigned a preliminary identification to the lowest possible taxonomic level, and entered in the BioCode Field Information Management System (FIMS; Deck et al. 2012). When possible, taxonomic specialists then provided more refined genus-or species-level identifications. Community sampling, DNA extraction, and barcode sequencing methods of specimens for a variety of taxonomic groups as part of the Moorea BioCode project are described elsewhere (Nitta 2008, Nitta et al. 2011, Diaz et al. 2013, Bonnet and Lotufo 2015, including methods specific for insects (Ramage et al. 2017). For each specimen, a DNA barcode sequence (corresponding to the 5 0 region of cytochrome oI) was generated following standard protocols and recorded by the BioCode Laboratory Information Management System (LIMS; Parker et al. 2012), a free plugin developed for the Geneious software system (Kearse et al. 2012; available online). 12 All Bio-Code sequences and their associated metadata are available via The Genomic Observatories Metadatabase (Deck et al. 2017; data available online). 13

Data filtering
As a first filter, sequences of less than 300 base pairs (b.p.) were removed from the data set to ensure that included sequences had sufficient overlap to allow for pairwise distance estimates in downstream analyses. Each sequence was compared to published sequences in the NCBI GenBank database using the blastn search algorithm (Altschul et al. 1990) run locally using the command-line interface and compared to the nt sequence database (downloaded 11 July 2016) with all default summary statistics retained in tab-delimited format for the top-match sequence, except that we also retained the taxids number for the top-math accession. The taxids number was then used to query the NCBI Taxonomy Database to provide higher level taxonomic information using a custom interactive python script blast2taxoninfo.py (H. Krehenwinkel, G. de Kerdrel, J.C. Andersen, and R. Gillespie, unpublished data). The order-level morphological identification of each specimen was then compared to the order-level assignment from the top-match sequence in the NCBI GenBank Database, and any specimens for which there was disagreement between morphological-based order assignments and the order assignment for the top-match sequence were considered as "contaminant" sequences and removed from analyses. DNA sequences, locality collection information, field collection identification, and NCBI top-match information for each retained specimen are provided in Data S1: CollectionInformation.csv.

Species delineation and categorization
For the spiders and for each order of insect, unique alignments were generated using the nucleotide alignment software MUSCLE (Edgar 2004) as implemented through the EMBL-EBI web service . Alignments were subsequently refined in Geneious using the Translational Align sub-option for the Multiple Alignment tool based on visual examinations of the reading frame prior to refinement using the Invertebrate Mitochondrial genetic code. Refined alignments were inspected and edited by hand when necessary, and further filtered to remove any sequences that did not have >100 b.p. of sequence overlap with every other sequence in the alignment. Specimens were then assigned to species-groups using Automatic Barcode Gap Discovery (ABGD; Puillandre et al. 2012) using the default values with all partitions and tree files were output using the -a setting. While these groups are technically operational taxonomic units (OTUs), for simplification, we will refer to these groups as "species," though we acknowledge that "species" may not be appropriate in all instances. To determine the optimal delineation scheme from ABGD, we then examined the slopes for the results of the number of species identified using the Recursive Partitions and the Initial Partitions and the prior intraspecific divergence, and for each order chose the value (P) that represented the intercept of these two slopes.
Using the species-group assignments identified above, we then updated specimen identifications using a two-step approach. First, for any species from which one or more specimens had a collections-based species-level identification, that identification was applied to all other unidentified specimens assigned to the same species. Second, the labels for the remaining species that lacked species-level identifications were updated using the NCBI GenBank best-match subject sequences as follows: for any species for which a specimen had a ≥97% match to a published sequence with a species-level identification, that species was assigned the species-level identification from the published sequence; for any species for which a specimen had a ≥90 but <97% match to a published sequence with a species-or genus-level identification, that species was assigned the genus-level identification from the published sequence; for any species for which a specimen had a ≥80% but <90% match to published sequence with a species-, genus-, or family-level identification, that species was assigned the family-level identification sequence from the published sequence; the taxonomic IDs for all remaining species were then updated to include the order name followed by "sp." (e.g., Diptera sp.). The original specimen identifications, the species-group to which they were assigned by ABGD, the updated specimen identifications, and whether those updates were based on morphological or molecular methods are provided in Data S1: Collec-tionInformation.csv.

Biogeographical status categorization
For each species for which we could obtain a specieslevel identification, we then assigned that species the biogeographical categorizations presented for it by Ramage (2017). The categorizations are provided in Data S1: RamageChecklist.csv.

Model development
For each order, a distance matrix based on that order's nucleotide alignment was generated. Using a custom script (provided in Data S1: CustomRscript.R) written in the statistical language R v 3.1.3 (R Core Team 2014), we then used the distance matrix to estimate the average distance between individuals within a species (i.e., average similarity) and average distance between individuals of nearest-neighbor species (i.e., average distance). The data was then subset to include a training data set that included the above statistics only for species with species-level identifications and a biogeographical categorization of either endemic or introduced in Ramage (2017), and the full data set that included the statistics for all species regardless of whether or not they had a species-level identification or a known status of endemic/introduced. Preliminary analyses based on Welch's two-sample t test using the training data set (Appendix S1: Fig. S1) indicated that there were highly significant differences between endemic and introduced species based on measures of average distance (t = 5.897, df = 114.1, P value < 0.0001), while there were no significant differences based on measures average similarity (t = 0.958, df = 112.5, P = 0.34). We then constructed four generalized linear mixed models (GLMMs), using the R package lme4 (Bates et al. 2015) with species categorization as the response variable coded as binomials (i.e., endemic = 1, and introduced = 0), average similarity and/or average distance as fixed effects, and we included order as a random effect because we were concerned that the model might have better fit for some orders than others. Models were then compared based on AIC c scores using the R package AICcmodavg (Mazerolle 2017) to determine the model with the best fit for the data set. Using the loadings from the best-fit GLMM calculated using the training data set, we then assigned scores from 0 to 1 to each species in the full data set using the predict command in R. Ninety-five percent confidence intervals (CI) from the GLMM assignment scores for endemic and introduced species in the training data set were then calculated (mean AE 1.96 9 SE). Any species in the full data set with a predicted value ≤ the upper 95% CI for introduced species was assigned the label "GLMM-Nonnative," any species in the full data set with a predicted value ≥ the lower 95% CI for endemic species was assigned the label "GLMM-Native," and any species in the full data set with a predicted value that fell between the two CIs was assigned the label "GLMM-Undet." Error rates were then calculated globally and for each order by recording the percentage of species in the training data set that had a known endemic/introduced categorization but were incorrectly assigned the opposite native or nonnative status based on the GLMM.

Specimen sampling and data filtering
After removal of sequences of <300 base pairs, and the removal of sequences identified as "contaminants" based on order-level mismatches between the morphological identifications and the order-level identification for their best-match sequence in GenBank, the final data set included cytochrome oxidase I (COI) sequences from 4,663 specimens. Species-level identifications were given to 2,172 specimens using a combination of morphologically based identifications, identifications published in Ramage et al. (2017), or identifications based on matches of ≥97% similarity to a published sequence in the NCBI GenBank database.
Model development AIC c scores for the four GLMM models based on between species distances are presented in Table 1. The model with the lowest AIC c score was determined to be For introduced species in the training data set, the mean GLMM value was 0.2582 with a 95% confidence interval of 0.2291-0.2873. For endemic species in the training data set, the mean value was 0.3839 with a 95% confidence interval of 0.3444-0.4233. Mean values and 95% confidence intervals for individual orders are presented in Table 2.

GLMM-based biogeographical categorizations and error-rates
Using the GLMM value thresholds described above, specimens in the full data set were assigned labels of either GLMM-Nonnative (n = 2,040), GLMM-Native (n = 1,457), or GLMM-Undet (n = 639). For an additional 509 specimens, we were unable to calculate a GLMM value as they were the only representatives of their respective species. Two-hundred and seventythree species were predicted to be likely native, 393 species were predicted to be likely nonnative, 100 species were unassigned as their GLMM values fell between the 95% confidence intervals for known introduced and endemic species. Among endemic species, 24 species were predicted to be likely native, 10 species were predicted to be likely nonnative, five species fell between the 95% CIs for endemic and introduced species, and six species were assigned "NA," as they were singleton species (total unassigned, 22.2%). Among introduced species, 49 species were predicted to be likely nonnative, 23 species were assigned as likely native, eight species fell between the 95% CIs for endemic and introduced species, and 22 species were assigned "NA" as they were singleton species (total unassigned 22.6%). GLMM scores for all species are presented in Data S1: SummaryGeneticInformation.csv and summarized for each order including error rates in Table 3 and presented graphically in Fig. 1.

Within and between species variation
Based on GLMM classifications, the method provides estimates of a phylogenetic signature of both native and nonnative species (Data S1: SummaryGeneticInformation.csv). The mean values predicted by the GLMM for within species variation (i.e., average similarity) for native species was 0.0076 AE 0.0006 and for nonnative species was 0.0095 AE 0.0010, and for between species variation (i.e., average distance) for native species was 0.0531 AE 0.0015 and for nonnative species was 0.1788 AE 0.0025. GLMM estimates are presented in Data S1: SummaryGeneticInformation.csv.

DISCUSSION
DNA sequences are increasingly being used for the rapid quantification of biodiversity (Smith and Fisher 2009), and for some taxonomic groups may provide a more accurate overview of species diversity than traditional (i.e., morphological) methods (Teasdale et al. Note: Akaike information criterion corrected for sample size (AIC c ) scores for GLMM models including log-likelihood (LL) values, number of parameters (K), AIC c score, change in AIC c scores from one model to the next (ΔAIC c ), and the models weight (w). Note: Values are reported as the mean value for species in each order followed by the lower 99% CI and the upper 99% CI in parentheses.

July 2019
CATEGORIZING UNIDENTIFIED SPECIMENS Article e01914; page 5 2013, Stein et al. 2014). While these approaches can be used to highlight biogeographic regions that are at risk due to anthropogenic disturbances (Bilgin et al. 2016) and have great potential for use by ecologists, evolutionary, and conservation biologists (Valentini et al. 2009, Kress et al. 2015, frequently the utility of these methods are constrained by the presence of incomplete reference libraries (Elias et al. 2007, Gonzalez et al. 2009). Here we have presented a technique that can be utilized by any community DNA sequencing project to rapidly categorize unidentified specimens or specimens from species whose biogeographic status is unknown as representing either likely native or nonnative species. For example, using a traditional reference library approach (i.e., combining morphological identifications with matches to sequences in public databases with species-level identifications), we were able to categorize 794 of the 4,663 sequenced specimens (n = 147 species) as being either endemic or introduced species. In contrast, using our GLMM approach, 3,497 of the 4,663 sequenced specimens (n = 707 species) could be categorized as either likely native or likely nonnative. We discuss several possible applications for this approach below.

Applicability for biosecurity and quarantine monitoring
Biosecurity, particularly in regards to the identification of specimens in quarantine situations, has benefited greatly from the use of DNA sequencing approaches to identify known pests Ball 2005, Ball andArmstrong 2006) and exotic species (Pejovic et al. 2016, Ardura andPlanes 2017), and in some locations national, and even international, programs are being established to integrate DNA sequencing and biosecurity efforts (e.g., Hodgetts et al. 2016). Yet, DNA approaches fail to identify potential quarantine species when previously sequenced individuals of that species are missing from the reference library. While previous research has shown that the utility of DNA sequences to improve biosecurity monitoring can further be increased through the integration of multiple forms of analysis (Collins et al. 2012), other approaches that might improve the identification of these "known-unknown" and "unknown-unknown" specimens that are absent from reference libraries includes the use of multi-species multi-locus coalescent (Dowton et al. 2014, though see Collins and Cruickshank 2014 for a critique) or methods that utilize molecular clocks coupled with population genetics to determine whether a species arrived in a location prior to human colonization (Pisa et al. 2015). While our approach cannot provide species identifications for specimens absent from the reference library, it can be used to prioritize which specimens are examined by trained taxonomists. Under our approach, the time required to provide a morphological identification could be streamlined by reducing the number of specimens and prioritizing those specimens (based on their GLMM scores, for example) that a trained taxonomist examines  Notes: Below each categorization, the numbers of species predicted to be likely endemic, likely introduced, as well as the number of species that could not be categorized by the model either because the GLMM value fell between the 95% Confidence Interval for Introduced and Endemic species (Undet) or because the species was represented by only a single specimen (NA), are shown along with the error rate (i.e., the percentage of species assigned as likely introduced for endemic species, or likely endemic for introduced species). The global error rate for each order is provided in the rightmost column.
in a given day. In addition, by working in concert with trained specialists, when species-level assignments are provided by the taxonomist, the sequence data for the specimen and its associated metadata could then be used to update the reference library and the GLMM model to improve future predictions. However, further validation of the approach and explorations of what are acceptable error rates is required, preferably from multiple localities and from multiple taxonomic groups.

Limitations and future directions
While we are hopeful that our approach can be broadly applied to both biodiversity and biosecurity issues, we need to highlight several limitations that must be addressed before it can be used more widely. The greatest limitation to our approach is the requirement for community-wide sampling, preferably obtained using standardized collection methods from random localities: methods that might miss rare or cryptic species and result in the collection of large numbers of "common" specimens. A second challenge for tropical ecosystems comes from the lack of endemic species in existing reference libraries. Because it is more likely that cosmopolitan species will have DNA sequences in public reference libraries, our model (and others generated for different study regions) might be inherently biased towards the categorization of introduced species. In certain circumstances, however, species-level identifications might not be necessary for endemic lineages where all members of a genus (or higher taxonomic categorization) are endemic to the focal locality, and thus the generic (or higher) assignments could be utilized to assign those specimens as endemic when building the GLMM. A third limitation comes in regards to picking a biologically meaningful cutoff for the GLMM scores to differentiate endemic and introduced species. Here, we adopted a conservative approach, and only assigned species as likely native or likely nonnative based on 95% CIs from known endemic and known introduced species. This resulted in a large number of specimens (n = 639) and species (n = 113) being uncategorized as their GLMM scores fell between the upper 95% CI for introduced species and the lower 95% CI for endemic species. One possible solution for determining biologically meaningful cutoffs is the use of machine learning, an approach we are currently exploring. A fourth potential limitation of our model is in regards to the effects of multiple introductions. Certain taxonomic groups are likely to be introduced repeatedly, and often from multiple locations, thus resulting in additive genetic variation and or presenting short branching  (2017) is before the colon and the biogeographic assignment based on the GLMM predictions is presented after the colon. Categorizations follow Table 3, except that two categories for the GLMM predictions (Undet and NA) have been combined into a sing category here labeled "Not Categorized." patterns similar to endemic species that might influence our predictive model. A fifth potential limitation of our model is that it requires multiple samples from each species. In our sample, 509 specimens represented the only examples of their respective species, and thus we were unable to calculate within species diversity. A sixth potential limitation of our model is that it may work better for certain orders than for others. For example, while unclassified species from most orders were assigned to a mix of both native and nonnative categories, for several orders all species were classified as either native or nonnative. Whether this is the result of uneven sampling, too few species in the training data set, and/or different rates of evolution, is unknown, but we expect that in the future order-specific analyses will likely improve the overall predictive power of the model. Finally, and perhaps the greatest limitation of the approach, it is unclear to which ecosystems our approach can most appropriately be applied. Evolutionary biologists have long focused on island ecosystems due to their isolation, their high rates of endemism, and their suitability for observing the effects of evolution (Darwin 1859, Wallace 1880, Gillespie and Roderick 2014, Cressey 2015. However, whether our approach is transferable to continental systems, where species mixing occurs at higher rates and phylogenetic patterns may be more diffuse, is unclear, though we enthusiastically encourage the testing of our method for categorizing community barcode sequences collected in additional settings.