From DNA sequences to microbial ecology: Wrangling NEON soil microbe data with the neonMicrobe R package
Corresponding Editor: R. Chelsea Nagy.
Abstract
Soil microbial communities play critical roles in various ecosystem processes, but studies at a large spatial and temporal scale have been challenging due to the difficulty in finding the relevant samples in available data sets as well as the lack of standardization in sample collection and processing. The National Ecological Observatory Network (NEON) has been collecting soil microbial community data multiple times per year for 47 terrestrial sites in 20 eco-climatic domains, producing one of the most extensive standardized sampling efforts for soil microbial biodiversity to date. Here, we introduce the neonMicrobe R package—a suite of downloading, preprocessing, data set assembly, and sensitivity analysis tools for NEON’s newly published 16S and ITS amplicon sequencing data products which characterize soil bacterial and fungal communities, respectively. neonMicrobe is designed to make these data more accessible to ecologists without assuming prior experience with bioinformatic pipelines. We describe quality control steps used to remove quality-flagged samples, report on sensitivity analyses used to determine appropriate quality filtering parameters for the DADA2 workflow, and demonstrate the immediate usability of the output data by conducting standard analyses of soil microbial diversity. The sequence abundance tables produced by neonMicrobe can be linked to NEON’s other data products (e.g., soil physical and chemical properties, plant community composition) and soil subsamples archived in the NEON Biorepository. We provide recommendations for incorporating neonMicrobe into reproducible scientific workflows, discuss technical considerations for large-scale amplicon sequence analysis, and outline future directions for NEON-enabled microbial ecology. In particular, we believe that NEON marker gene sequence data will allow researchers to answer outstanding questions about the spatial and temporal dynamics of soil microbial communities while explicitly accounting for scale dependence. We expect that the data produced by NEON and the neonMicrobe R package will act as a valuable ecological baseline to inform and contextualize future experimental and modeling endeavors.
Introduction
Microbial life on earth is ubiquitous and essential in critical ecosystem processes (Cavicchioli et al. 2019). Soils are among the most diverse microbial habitats known, and recent surveys at continental (Fierer et al. 2012, Ladau et al. 2013, Talbot et al. 2014, Prober et al. 2015, Thompson et al. 2017, Wang et al. 2018) and global scales (Serna-Chavez et al. 2013, Thompson et al. 2017, Chu et al. 2020) have shed light on the diversity and distribution of soil microbes. Such large-scale studies often identify abiotic environmental factors, such as climate and edaphic characteristics, to be strong predictors of soil microbial community composition. For example, soil fungal richness is strongly determined by climate (Tedersoo et al. 2014, Větrovský et al. 2019, Steidinger et al. 2020), soil protist composition by annual precipitation (Oliverio et al. 2020), and bacterial composition and function by edaphic characteristics such as pH and soil carbon (Lauber et al. 2009, Delgado-Baquerizo et al. 2016). Enough data have accumulated that global meta-analyses of environmental controls of microbial biogeography have recently emerged (Větrovský et al. 2019), and we are just beginning to discern the influence of biotic interactions on microbial diversity and distribution (Bahram et al. 2018, Steidinger et al. 2019).
While molecular microbial surveys have contributed much to our understanding of microbial ecology, they have also highlighted unique problems. One conceptual challenge in microbial ecology is scale. Classical ecological theories and sampling techniques were developed with macroorganisms in mind and may not apply well to microbes (Levin 1992). Thus, drivers of microbial diversity and distribution are highly scale-dependent (Martiny et al. 2011, Peay et al. 2016). Sampling techniques that preserve sample scale dependency are important to broaden our understanding of microbial community ecology. For example, the strength of positive and negative interactions in microbial community assembly processes is predicted to occur at distinct spatial scales (Mod et al. 2020). However, spatially explicit tests of assembly rules for microbial groups are lacking (Talbot et al. 2014, Maynard et al. 2017), and cross-study comparisons are stymied by widely varying sample granularity and survey extent (Zinger et al. 2019). In addition to variation in sampling scale, the differences between protocols of sequence-based microbial ecological research can make the interpretation of meta-analysis challenging. For example, differences in sample collection, such as soil core size, storage method, DNA extraction, sequencing, and bioinformatic approaches, can all have oversized effects on our final understanding of microbial abundance, diversity, and distribution (Lindahl et al. 2013, Pauvert et al. 2019). Thus, laboratory and bioinformatic standardization must complement field sampling designs to truly empower ecological inferences.
The National Ecological Observatory Network (NEON) is a multi-scale ecological observation platform spanning the United States for understanding and forecasting the impacts of climate change, land-use change, and invasive species on ecosystems (Schimel and Keller 2015). NEON is designed to enable users, including scientists, educators, policymakers, and the general public, to assess large-scale and long-term ecological changes and address their major drivers. The NEON Terrestrial Observation System monitors environmental drivers and key taxonomic groups at multiple trophic levels in order to quantify the responses of biodiversity and biogeochemical cycles to climate and land-use changes. A component of this data collection program, the NEON Microbial Ecology Sampling Program, measures the diversity and abundances of microbiota and archives raw samples and DNA for public research use. Sampling and analysis for microbes are performed using standardized and freely available methods (Stanish et al. 2018) that help eliminate confounding factors in cross-site or cross-study analyses. The data collected by NEON are processed into documented, calibrated, and quality-controlled data products and are openly available through the NEON Data Portal and API (Application Programming Interface).
Given the high diversity of microbial communities relative to most macroorganisms, NEON microbial data also present unique challenges for the end user. For those new to “big data,” accessing the NEON API may seem unintuitive: Acquiring metadata entails downloading data from multiple NEON products and preprocessing metadata through table joins before analysis. For high-throughput microbial marker gene (amplicon) sequencing data, files are relatively large and cannot be readily visualized, and often advanced bioinformatic tools and computing capabilities are needed to analyze the data. Our goal is to lower the barriers of entry to utilizing NEON microbial data. We provide a data processing pipeline and software package to wrangle NEON soil microbial community data and promote its wider accessibility and use in a standardized manner, thereby maximizing its potential for developing ecological insights.
In this paper, we introduce neonMicrobe, a novel, quality-tested data pipeline that standardizes the processing of NEON soil bacterial and fungal amplicon sequence data into abundance tables for microbial ecology research. While the current scope of this paper and the accompanying neonMicrobe R package is on soil microbes and the soil environment, our package provides the scaffolding for the analysis of surface water and benthic microbe marker gene sequence data. Our pipeline builds on existing validated protocols (Tedersoo et al. 2015b, Callahan et al. 2016, Lunch et al. 2021) to create a reproducible way to download, quality control, and process sequence data into sequence tables all within the R statistical computing environment (R Core Team 2021). Acknowledging the complexities behind selecting appropriate parameters for various stages of the pipeline, we use a sensitivity analysis to demonstrate how changes in read quality filtering parameters may affect downstream ecological inferences. We conclude with lessons learned related to the provisioning of microbial data by NEON, as well as the use of NEON data products by researchers to generate new insights about microbial ecology.
NEON Soil Microbe Marker Gene Sequence Data Products
This paper utilizes soil microbial 16S and ITS sequence data from NEON (NEON DP1.10108.001), which primarily target bacteria and fungi, respectively. A full description of the sampling design and analysis methods can be found in the documentation available through the NEON Data Portal Web site (data.neonscience.org).
Sampling design
The NEON domain encompasses 47 terrestrial field sites across the United States including Puerto Rico, covering 20 eco-climatic domains as defined by NEON. Sites are strategically located in ecosystems across the United States so that site-level measurements can be used to extrapolate across the continent (Barnett et al. 2017). Each terrestrial field site contains 10 plots for soil microbial sampling: four tower plots within the airshed of an instrumentation tower and six distributed plots that are designed to be spatially balanced while reflecting the dominant vegetation type at the site (Stanish et al. 2018). Each plot is divided into four subplots, three of which are randomly chosen for sampling. Random coordinates with a 1-m buffer zone are generated for each subplot. Up to three soil cores may be collected within 0.5 m of each set of coordinates and combined to provide a sufficient sample volume for downstream processing and analyses.
Sampling events are broadly designed to capture periods when microbial activity is expected to be at its highest, or when activity may be rapidly changing, such as the transition from the dry to wet seasons or during spring soil thawing. Sampling does not occur when the ground is frozen or covered in snow due to logistical and safety concerns, which may miss critical periods in snow-covered ecosystems (Schadt et al. 2003). Most sites are sampled for microbial community characterization three times per year, with one event corresponding to peak plant productivity as measured using remote sensing data (Stanish et al. 2018, Stanish and Parker 2019). In sites where the activity is more strongly driven by precipitation than temperature, historical precipitation data are used to determine sampling periods.
Soil cores are collected down to a maximum depth of 30 cm. If an organic horizon is present, it is collected separately from the mineral horizon. The cores are co-located with other critical soil physical and biogeochemical measurements, including litter depth, temperature, moisture, pH, and nutrients. The metadata associated with NEON samples exceed the minimum standards defined by the Genomics Standards Consortium (Yilmaz et al. 2011). Once collected, the cores are separated by horizon, homogenized, and subsampled for microbial and chemical analyses. The microbial samples are frozen in the field and shipped to analytical facilities for DNA sequencing analysis.
Molecular methods
Sample processing and analyses are performed using standardized methodologies to the extent possible to ensure comparability of data over time. However, methodologies and technologies will change and improve, and adapting to changes over time is critical. For full transparency, any changes in laboratory methods are captured in the freely available external laboratory standard operating procedures (SOPs), which are listed in the metadata for every downloaded sequence data set.
The processing methods for generating 16S and ITS sequence data used in this analysis are detailed in the 16S and ITS Sequencing Standard Operating Procedure (Battelle Memorial Institute 2018). Genomic DNA from thawed soil samples is extracted using Qiagen DNeasy Powersoil HTP 96 Kits and quantified with QuantiFluor ONE dsDNA Kits. The marker genes targeted are the V3–V4 regions of the 16S ribosomal RNA (rRNA) gene for bacteria and archaea (primers Pro341F and Pro805R; Takahashi et al. 2014) and the internal transcribed spacer (ITS) region of the rRNA operon for fungal identification (primers ITS1f and ITS2; Walters et al. 2015). Additional details on PCR processing and quality assurance can be found in the associated laboratory SOP (Battelle Memorial Institute 2019) and in the marker gene sequencing data product tables “mmg_soilPcrAmplification_16s” and “mmg_soilPcrAmplification_ITS” (NEON DP1.10108.001). All sequencing runs are performed on an Illumina MiSeq v3 600-cycle cartridge as 300-bp paired-end reads, resulting in one set of forward and reverse sequencing reads for each sample. A sequencing run usually consists of a library of samples pooled from multiple sites and collection dates, as well as DNA extraction and PCR controls (Battelle Memorial Institute 2018). At the time of this study, only samples with a minimum of 3000 reads and post-trimming mean quality score of 20 pass the quality filter for the NEON Data Portal (Battelle Memorial Institute 2018).
The neonMicrobe Marker Gene Sequence Processing Pipeline
The neonMicrobe R package promotes data accessibility by allowing users, especially those lacking extensive bioinformatic experience, to wrangle NEON’s soil microbe marker gene data products with greater ease and reproducibility. The pipeline begins by downloading NEON marker gene sequence data from the NEON Data Portal and produces amplicon sequence variant (ASV) abundance tables linked to associated taxonomic and soil abiotic data in a Phyloseq data structure. The pipeline, which builds on existing validated pipelines (Lindahl et al. 2013), explicitly considers the unique properties of NEON data with the goal of maximizing ecological insight of microbial communities. While there is no consensus on an optimal bioinformatic processing method for microbial amplicon sequence data (Pauvert et al. 2019), bioinformatic choices are consequential to downstream ecological analysis (Tedersoo et al. 2015a, Tedersoo et al. 2015b). We preferred bioinformatic approaches that would allow comparability across disparate data sets and require a relatively low level of programming knowledge for the end user (Callahan et al. 2016, 2017). These criteria align with a major goal of NEON: to use standard methods across data sets from large temporal and spatial scales, enabling many different studies to answer myriad questions.
The neonMicrobe R package includes functions for downloading, renaming, and subsetting NEON sequencing data, as well as custom wrappers for the DADA2 algorithms (Callahan et al. 2016, 2017). Briefly, DADA2 allows analysis of microbial taxa at the resolution of exact ASVs, in contrast to the more traditional use of operational taxonomic units (OTUs) that are based on a user-defined nucleotide sequence similarity (e.g., 97–98.5% pairwise sequence identity). In addition to the biological benefits of finer DNA sequence resolution by employing ASVs, exact sequences are highly advantageous to OTUs because they can be directly compared across data sets, making them critical to NEON’s coordinated network sampling design (Callahan et al. 2017). Another benefit is that the DADA2 pipeline allows sequence processing and data analysis steps to be conducted in the R statistical computing environment (R Core Team 2021), which enhances reproducibility and lowers barriers to entry for those who are less familiar with command-line bioinformatic tools. DADA2 has also demonstrated its compatibility with other methods and platforms in biological interpretation related to the assembly of paired-end reads, the treatment of chimeras, and the final filtering of the ASV tables (Pauvert et al. 2019). Our processing pipeline creates a ready-to-use soil microbial data set of unprecedented spatiotemporal range and taxonomic resolution. In the following subsections, we describe our processing pipeline. Each of the following subsections has a corresponding vignette in the neonMicrobe R package (Fig. 1), which can be accessed at https://github.com/claraqin/neonMicrobe.

Downloading and quality-controlling NEON soil microbe marker gene sequence data
The neonMicrobe data processing pipeline begins by leveraging the NEON Data API via the neonUtilities R package (Lunch et al. 2021) to acquire soil microbe marker gene sequencing data. First, the downloadSequenceMetadata function downloads and joins the tables within NEON data product DP1.10108.001 (Soil microbe marker gene sequences), which includes information about DNA extraction, PCR amplification, marker gene sequencing, and sequence file metadata. Because the output of this function contains information about sample processing but does not include the raw sequence files themselves, we refer to this output as sequence metadata. downloadSequenceMetadata can be parameterized to download a subset of raw sequence data according to a specific date range, site range, sequencing run, or target gene (16S or ITS). Metadata can be further filtered to remove records that include quality flags or fail certain quality tests, as described in greater detail below. These steps take place before the user downloads the raw sequence files, saving processing time and disk space. Next, the downloadRawSequenceData function references the metadata to download the desired raw sequence files. By default, NEON data will be organized into a directory structure illustrated by Fig. 2. These functions are implemented in the vignette “Download NEON Data.”

As with any analysis, ensuring that the downloaded data are high-quality, correctly formatted, and directly comparable is a critical data processing step. Performing quality control steps prior to entering the data analysis workflow can reduce downstream processing errors due to incomplete or improperly formatted data and improves efficiency by conserving CPU time processing low-quality data that may ultimately be discarded. The NEON microbial data products contain data quality flags in which known quality issues are reported. In addition to quality issues, the sequence metadata contain other crucial details, some of which can significantly affect the comparability of sequencing runs, such as specific laboratory protocols, oligonucleotide primer sets, and sequencing platforms. We strongly recommend that users review the sequence metadata and consider whether additional data filtering should be performed based on the research needs and data stringency requirements.
We have implemented a number of basic quality control steps in the function qcMetadata. In this function, users can opt to (1) remove samples that are flagged as having low read quality or being legacy data, (2) check for and remove duplicate samples, and (3) prepare for a paired-reads analysis by removing samples for which only one read orientation is available.
Generating sequence tables and taxonomy tables using DADA2
16S and ITS sequences are processed by different variations of the DADA2 workflow. Processing is done on a sequencing run basis to allow for variable error rates between sequencing runs to optimize amplicon sequence variant (ASV) calling and chimera detection. For each sequencing run, the generalized steps are as follows: (1) filtering samples to remove all reads containing ambiguous (“N”) base calls; (2) removing PCR primers, using Cutadapt (Martin 2011) for ITS sequences but not for 16S sequences; (3) truncating (for 16S reads only) and filtering reads to ensure a minimum quality score; (4) building an error model for each sequencing run to describe the probability that a given read was produced from a given sample sequence; (5) denoising reads into ASVs through the DADA divisive partitioning algorithm based on an underlying nucleotide sequence error rate model; (6) optionally, removing chimeric sequences; and (7) joining sequence tables across all sequencing runs using the DADA2 function mergeSequenceTables, which performs a simple merge, and the DADA2 function collapseNoMismatch, which performs 100% clustering on the ASVs. (Note that Cutadapt is not supported on Windows computers. For Windows users, we recommend running the ITS pipeline in another computer, or in a Docker container, as outlined in the section “Extending Scientific Workflow Reproducibility with Container Technology.”) As an alternative to Step 7 for ITS sequences, it may be prudent to cluster ASVs to a lower, user-specified sequence similarity threshold (e.g., 97–98.5%) using the VSEARCH or DECIPHER programs (Rognes et al. 2016, Wright 2016), because the same ITS ASV may have different length variants across different sequencing runs. This step is not needed for 16S reads, for which 100%-similar ASVs can instead be combined using the collapseNoMismatch command in DADA2. Finally, a taxonomic reference database can be used to assign taxonomy to the ASVs. Many of these processing steps are wrapped into the novel functions trimPrimers16S, qualityFilter16S, and runDada16S, and their ITS-specific analogues.
Linking sequence data and soil abiotic data in a Phyloseq object
By taking advantage of different NEON data products, users can draw inferences regarding the relationships between soil microbial community characteristics, soil physical and chemical properties, climate variables, and other spatiotemporal processes. As demonstrated in the “Add Environmental Variables” vignettes, the end product of the neonMicrobe pipeline is a Phyloseq object linking the ASV table, its (optional) taxonomy table, and associated soil abiotic data (NEON DP1.10086.001) downloaded using the downloadSoilData function, creating a data structure that is ready for ecological analysis (McMurdie and Holmes 2013). While an overview of statistical microbial community analysis is beyond the scope of this paper, excellent reviews of the subject (Hugerth and Andersson 2017) and tutorials using Phyloseq are widely available (McMurdie and Holmes 2013) (https://www.bioconductor.org/packages/release/bioc/vignettes/phyloseq/inst/doc/phyloseq-analysis.html).
Example: Analysis of soil bacterial diversity in grasslands
The processed data are immediately usable in analyses to answer ecological questions. To demonstrate this, we present a relatively simple analysis of soil bacterial diversity using NEON 16S sequence data that has been processed and assembled by neonMicrobe (Fig. 3). The code for this analysis is available as Data S2.

In this example analysis, we asked, what controls soil bacterial communities within and across sites in a grassland ecosystem? We included three sites—Central Plains Experimental Range (CPER), Konza Prairie Biological Station (KONZ), and Northern Great Plains Research Laboratory (NOGP)—as these sites share Argiustoll soils, but vary in climate and belong to different NEON eco-climatic domains (Fig. 3a). We used soil samples that were collected at peak plant productivity in 2017 (n = 86 samples), in order to minimize temporal effects. Across these sites, we examined the effects of soil pH and soil moisture on soil bacterial composition, as these were previously found to explain substantial variation in soil bacterial community composition across NEON sites (Docherty et al. 2015). We additionally included mean annual temperature (MAT) and mean annual precipitation (MAP) as climatic covariates. We used the adonis2 function in the vegan R package (Oksanen et al. 2020) to conduct permutational analysis of variance (PERMANOVA). We found that significant drivers of bacterial community composition in these grassland sites included soil pH (PERMANOVA, P < 0.001), MAT (P < 0.001), and MAP (P < 0.001), while the effect of soil moisture was not significant (P = 0.089). Our results suggest that climatic variables drive between-site variation, while soil pH drives within-site variation, in soil bacterial community composition across grasslands (Fig. 3d).
Extending Scientific Workflow Reproducibility with Container Technology
Reproducibility is a major principle of the scientific approach and is critical for successful application of the bioinformatic pipeline. One challenge that hinders reproducibility in computational tools is the staggeringly large number of possible combinations of operating systems, programming languages, and package versions a user may have installed locally, which allows variability to creep into analysis pipelines. Furthermore, there is a significant time and energy investment required to install dependencies, check operating system compatibilities, and implement code at a large scale. To minimize this cognitive load, efforts such as the Open Container Initiative started by Docker (https://opencontainers.org/) enable the deployment of discretized applications to cloud computing infrastructure. This container paradigm extends into bioinformatic tools through the BioContainers initiative (da Veiga Leprevost et al. 2017). The offering of these computational biology tools as containers allows users to move away from user-specific workflow generation on high-performance computing (HPC) systems and into cloud native scientific computing.
Due to the network of data products, supporting R packages, bioinformatic tools, computational resources, and operating system compatibility requirements associated with neonMicrobe, the neonMicrobe R package cannot encapsulate a reproducible scientific workflow on its own (Boettiger 2015). To extend its reproducibility, two Docker container images were created for neonMicrobe. First, an RStudio Server instance was created from the Rocker Group’s RStudio Server tidyverse base image (Nüst et al. 2020). This RStudio image is freely available on the CyVerse Docker Hub (https://hub.docker.com/repository/docker/cyversevice/rstudio-neon-dada2), as well as through the CyVerse Discovery Environment’s (DE) Visual Interactive Computing Environment (VICE) as the “rstudio_neon_microbiome” application. The CyVerse DE allows users to interact with data and Docker containers on VICE without explicitly requiring mastery of Docker in the command line. The second Docker container image of neonMicrobe is strictly command-line based and designed for scaling to larger cloud systems; it is also available on Docker Hub (https://hub.docker.com/r/rbartelme/neonmicrobe). Therefore, users may utilize either of these containers on their local systems, increasing both access to the tools and creating a more easily reproduced environment to conduct microbial ecology experimental analyses.
Sensitivity Analysis of Quality Filtering Parameters
The choice of bioinformatic software and processing parameters can have implications for the accuracy of the inferred microbial community (Pauvert et al. 2019, Prodan et al. 2020). While we make some recommendations for processing the NEON marker gene sequences, such as the use of DADA2 over OTU-based processing pipelines, we leave other decisions to the researcher depending on their research needs and computing capacity (Appendix S1: Fig. S1). These decisions include but are not limited to: the removal or retention of reverse reads; the choice of parameters for the quality filter; and the choice of partitioning, alignment, and sequence comparison heuristics for DADA2.
Exploring all combinations of these decision points to arrive at an optimal processing pipeline is beyond the scope of this paper. We expect that the combination of choices that creates the most accurate representation of the NEON soil microbial communities will change depending on the specific set of samples being processed or the metrics of interest to the researcher. However, for any given instance of the NEON marker gene sequence data, it should be possible to evaluate how the choice of processing parameters influences some benchmark metrics related to the pipeline outputs. Here, we provide a framework for conducting a sensitivity analysis on the processing pipeline, using the quality filtering parameters for 16S amplicons as an example.
We investigated how the choice of parameters for the quality filter—which truncates or removes low-quality reads—would influence our downstream ecological inference. This represents one of the first such sensitivity analyses to compare multiple sequencing runs and bioinformatic platforms in an ecologically robust manner. To assess parameter sensitivity, we considered the effects of quality filtering parameters on the following outcomes: (1) number of reads remaining at each step of the pipeline, (2) estimated alpha diversity, and (3) estimated beta-diversity, using a subset of 16S sequences as a test case.
- truncLenR: Reverse reads that do not meet or exceed truncLenR in length will be discarded. Reverse reads that exceed truncLenR will be truncated to truncLenR.
- maxEER: After truncation, reverse reads with higher than maxEER expected errors will be discarded. Expected errors are calculated from the nominal definition of the quality score: , where l is a base position index extending to the length of the sequence, L.
The code used to conduct this sensitivity analysis is available in the Supporting Information (Data S3). In summary, the sensitivity analysis evaluates variation in our benchmark metrics with respect to variation in parameter values. It does this by randomly selecting 10 samples from each of the 20 available 16S sequencing runs as of January 2021 and processes these samples on a sequencing run basis through the 16S pipeline under a variety of quality filtering parameter combinations. The truncLenR parameter was assigned values of 170, 220, and 250 base pairs (bp), representing short, medium, and long truncation lengths for the 2 × 300 bp reads produced by Illumina MiSeq. The maxEER parameter was assigned values of 4, 8, and 16 maximum allowable errors as invoked in the core DADA2 algorithm; the preferred values may vary substantially between sampling locations, soil types, and laboratory protocols, so these values were intended to cover a wide range of desirable values. Together, these values resulted in nine parameter combinations. The following processing decisions were held constant over all pipeline iterations: Minimum required length of reads after trimming and truncating (minLen) was set to 50 bp; forward reads were processed with truncation length of 240 bp (truncLenF) and a maximum of eight allowable expected errors (maxEEF); all other quality filtering parameters were set to their default values; and all sequence alignment heuristics for DADA2 were set to their default values.
Alpha diversity was calculated using the Phyloseq function estimate_richness (McMurdie and Holmes 2013) for Shannon diversity and observed richness. Beta diversity was assessed by joining sequencing tables from across all parameter combinations into one combined sequence table without collapsing sequence-length variants and calculating the pairwise Bray-Curtis distance between all versions of all samples. Samples with a sequencing depth below 1000 were removed prior to ordination and permutational analysis of variance (PERMANOVA). PERMANOVA was conducted via the adonis2 function in the vegan R package (Oksanen et al. 2020).
Sensitivity analysis results
The 200 selected samples represented 37 terrestrial NEON sites, collected between May 2014 through November 2018. Sequence read retention throughout the processing pipeline varied across both parameters, though the degree to which they varied depended on the sequencing run. As expected, higher values of maxEER resulted in greater read retention at the quality filtering step (Fig. 4). Differences in read retention between sequencing runs could be explained by differences in the quality scores of the reads from each sequencing run. For example, sequencing run BDNB6, whose read retention is relatively sensitive to maxEER, accumulates more expected errors across its read length than sequencing run BFDG8, whose read retention is relatively insensitive to maxEER (Appendix S1: Fig. S2). Read retention was relatively insensitive to truncLenR except when reverse reads were truncated to 170 bp—this would cause a large drop-off in the pair-merging step, likely due to insufficient overlap between forward and reverse reads (Fig. 4). Overall, we found that a moderate value for truncLenR (220 bp) led to the highest rates of read retention.

The alpha diversity metrics used in the sensitivity analysis were ASV Shannon diversity and observed ASV richness. Shannon diversity (ANOVA, P < 0.001; Table 1) and observed richness (P < 0.001; Table 2) were both sensitive to variation in truncLenR, and the effect of truncLenR varied across sequencing runs (P < 0.001; Table 1; Table 2; diagnostic plots for ANOVA in Appendix S1: Figs. S3, Fig. S4). Consistent with our finding that read retention was highest for moderate values of truncLenR, we also found the highest estimates of Shannon diversity and observed richness when truncLenR was 220 bp (Fig. 5). In contrast, maxEER had no significant effects on Shannon diversity (P = 0.242; Fig. 5; Table 1) or observed richness (P = 0.151; Appendix S1: Fig. S5; Table 2). Although maxEER does affect read retention at the quality filtering stage for some sequencing runs, our results suggest that for the NEON 16S sequences in general, differences in maxEER have relatively inconsequential effects on estimates of soil microbial alpha diversity. However, researchers who extend this pipeline to other data sets, such as the NEON ITS sequences, should conduct a similar sensitivity analysis before proceeding to make ecological inferences about the processed data.
Covariate | df | SS | Mean Sq | F | P(>F) |
---|---|---|---|---|---|
truncLenR | 1 | 81.616 | 81.616 | 403.053 | <0.001 |
maxEER | 1 | 0.277 | 0.277 | 1.369 | 0.242 |
runID | 19 | 321.151 | 16.903 | 83.473 | <0.001 |
truncLenR × runID | 19 | 36.442 | 1.918 | 9.472 | <0.001 |
maxEER × runID | 19 | 0.132 | 0.007 | 0.034 | 1.000 |
Residuals | 1740 | 352.339 | 0.202 |
Covariate | df | SS | Mean Sq | F | P(>F) |
---|---|---|---|---|---|
truncLenR | 1 | 14.598 | 14.598 | 66.805 | <0.001 |
maxEER | 1 | 0.452 | 0.452 | 2.069 | 0.151 |
runID | 19 | 485.837 | 25.570 | 117.020 | <0.001 |
truncLenR × runID | 19 | 48.997 | 2.579 | 11.802 | <0.001 |
maxEER × runID | 19 | 0.270 | 0.014 | 0.065 | 1.000 |
Residuals | 1740 | 380.213 | 0.219 |

Variation in the parameters resulted in a small degree of variation in inferred community composition. truncLenR had a significant effect on Bray-Curtis dissimilarity (PERMANOVA with 999 permutations, P < 0.001, R2 = 0.012), while maxEER had no significant effect (P = 1.000, R2 = 3 × 10−5; Table 3). Upon inspection, setting truncLenR = 170 produced communities with significantly less group dispersion (variance) than at higher values of truncLenR (betadisper multivariate test for homogeneity of group dispersions in the vegan R package, P < 0.001; Appendix S1: Fig. S6). To confirm that the significant effect of truncLenR in the PERMANOVA analysis was attributable to differences in group means rather than differences in group dispersions, PERMANOVA was repeated on the data set after removal of communities produced with truncLenR = 170. Within this subset, the sensitivities of Bray-Curtis dissimilarity to the quality filtering parameter remained largely the same: truncLenR had a significant effect (PERMANOVA with 999 permutations, P < 0.001, R2 = 6.8 × 10−4) while maxEER did not (P = 0.697, R2 = 5 × 10−5; Appendix S2: Table S1). Since group dispersion did not vary significantly between the remaining values of truncLenR in the subset (betadisper, P = 0.388), the results of this repeated analysis confirm that community composition of the NEON 16S sequences is sensitive to truncLenR. Nevertheless, the amount of variation explained by truncLenR (R2 = 0.012) is small compared with that explained by soil sample ID (R2 = 0.771; Fig. 6, Appendix S2: Table S2), suggesting that variation in quality filtering parameters is unlikely to obscure real between-sample variation in community composition.
Covariate | df | SS | R 2 | F | P(>F) |
---|---|---|---|---|---|
truncLenR | 1 | 9.61 | 0.01235 | 21.373 | <0.001 |
maxEER | 1 | 0.02 | 0.00003 | 0.050 | 1.000 |
Residuals | 1709 | 768.75 | 0.98762 |
Notes
- Communities were permuted 999 times within sample IDs; that is, each community was compared against other communities produced from the same sample. For a PERMANOVA analysis that includes sample ID as a permuted variable, see Appendix S2: Table S2.

Based on these results, we advise against varying truncLenR between sequencing runs, as it may lead to inconsistent standards of ecological inference across data sets consisting of samples from multiple runs. However, varying maxEER to suit the overall quality of each sequencing run may be appropriate depending on the metrics of interest. The sensitivity analysis framework above can be generalized to test the robustness of ecological inference to other processing decisions, such as paired-end read merging, DADA2 sequence alignment heuristics, and incorporation of data from different sequencing runs or sequencing platforms.
Lessons Learned and Future Directions
Technical challenges associated with processing large-scale marker gene sequence data sets
There is significant technical variation among NEON sequencing runs that inevitably impacts subsequent bioinformatic processing. While most NEON sequencing runs produced high-quality data, certain runs generated substantially fewer sequences that passed quality filtering (Fig. 4). Accordingly, the quality filtering parameters recommended here necessarily represent a compromise, given the goal of compiling dozens of sequencing runs generated by different sequencing centers over many years. A critical step of the pipeline proposed here requires that the same portion of rRNA is used to denoise ASV across all sequencing runs; for compatibility, we recommend future studies employ identical primers for ease of cross-study comparison. Variation across Illumina sequencing runs necessarily generates variation in the behavior of quality filtering parameters employed in DADA2; however, these parameters must be standardized across sequencing runs in order to join ASV tables and to cluster artefactual sequence-length variants. As Illumina sequencing chemistry changes and new platforms emerge, we expect that the filtering steps employed here will need to be updated. Notably, because reverse reads were consistently low quality across ITS sequencing runs, paired-end ITS read processing is not explicitly supported by our pipeline. Low-quality ITS reverse reads are typical of Illumina MiSeq data. While the 250-bp unmerged forward read sequences may potentially bias against certain fungal taxa (Truong et al. 2019), the extent of this bias is likely small (Nguyen et al. 2015, Pauvert et al. 2019).
NEON’s continental-scale sample network captures a remarkably broad phylogenetic range of microbial taxa. Accordingly, analyzing the effect of geographic and ecological distance among samples depends on the taxonomic scale of investigation. Perhaps unsurprisingly, certain samples derived from distinct habitats share no ASV in common, creating disjunctions in community dissimilarity matrices that can complicate distance-based analyses, such as ordination. Clustering ASV at the OTU level, however defined (e.g., 97–98.5% sequence similarity), can reduce these statistical disjunctions and allow for more meaningful continental-scale analyses of community dissimilarity. Finally, access to sufficient computing resources represents a challenge inherent to data sets of this size. Although recent R packages such as SpeedySeq (McLaren 2020) can expedite some commands run with the popular Phyloseq package (McMurdie and Holmes 2013), we expect more future developments.
Finally, rapid advances in high-throughput sequencing technologies may allow for the generation of long reads that span the entire ITS1, ITS2, and 18S region for fungi, and the entire 16S for bacteria, thereby allowing enhanced resolution of fine-scale taxonomic boundaries and more accurate phylogenetic placement. In order to ensure compatibility with the extant NEON data presented here, large portions of read overlap with the existing ITS1 and 16S V3-V4 regions analyzed here will be necessary for sufficient sequence alignment and ASV and OTU clustering.
Future directions for NEON-enabled microbial ecology
Spatial and temporal dynamics of soil microbial communities
NEON’s extensive soil sampling network fulfills a pressing need for standardized microbial data in advancing research on the spatial and temporal dynamics of soil microbial communities. Soil microbial communities are known to display rapid turnover in space (Franklin and Mills 2003, Nemergut et al. 2013) and time (Ferrenberg et al. 2013, Lauber et al. 2013, Shade et al. 2013). However, there has historically been a trade-off between spatial and temporal sampling intensity, limiting the generalizability of biogeographic patterns to unsampled regions or timespans. Spatially nested sampling designs like that of NEON’s soil data products allow researchers to quantify the spatial scaling of microbial diversity from soil cores to continents and to identify its drivers at each scale (Talbot et al. 2014). Furthermore, because NEON is committed to multiple decades of data collection, its steady accumulation of microbial sequence data will facilitate research on the scaling of microbial diversity over time, from intra-annual to decadal scales. In combination with other NEON data products—such as climate, soil physical, and chemical properties, and vegetation cover—the soil microbe data will also help to elucidate the fundamental drivers of temporal scaling (Guo et al. 2019).
NEON soil microbe data may also be informative in comparing the rates of scaling across intersecting gradients of spatial, temporal, and taxonomic scales. For example, a recent study suggests that intra-annual variation in soil fungal communities is comparable to that occurring over hundreds to thousands of kilometers of space (Averill et al. 2019). One of the implications of this rapid spatial and temporal turnover is that sample-pairwise compositional similarity may drop-off rapidly, creating a technical challenge for dissimilarity-based analyses when two samples in the data set have no taxa in common. This challenge can be partially addressed by shifting the unit of taxonomic analysis, for example, from ASVs to OTUs, or by using phylogenetic measures of beta-diversity (Lozupone and Knight 2005). Future studies, then, may also explore how turnover in soil microbial communities interacts with taxonomic scale.
In addition, the NEON sampling network provides a unique opportunity to observe lags in response time between abiotic variables and changes in soil microbes and to detect the influence of history on community structure. In traditional ecosystem modeling, microbial communities have been assumed to be resilient to disturbance and to return quickly to a state of equilibrium (Allison and Martiny 2008). However, a growing body of evidence suggests that this is not the case; microbes experience legacy effects from historical precipitation regimes (Evans and Wallenstein 2012), plant communities (Elgersma et al. 2011), and wildfires (Qin et al. 2020) that may last several years after the change from prior conditions. In the case of historical contingencies such as priority effects, the equilibrium state may also change (Hawkes and Keitt 2015). By comparing the temporal dynamics of soil microbial communities with other variables recorded in NEON data products, we can ask how long it takes for soil microbes to react to environmental shifts (e.g., in mean precipitation, in mean temperature), how resilient the microbial constituents are to this change, and whether historical events modify the equilibrium states of the microbial community. Furthermore, the NEON sampling network allows researchers to ask questions about synchrony in the spatiotemporal dynamics of microbial communities and to link these dynamics to stability in microbe-mediated ecosystem processes (Hall et al. 2018, Wang et al. 2019).
Finally, the breadth of NEON soil microbe data now allows researchers to compare the biogeographic patterns and processes of soil microbial communities with those of plants and animals, for which abundance data are also being collected at NEON sites. This can be used to test the generalizability of macroecological patterns (Xu et al. 2020, Dickey et al. 2021) or temporal patterns (Shade et al. 2013, Guo et al. 2019) that have traditionally been developed for macroorganisms. It may also be leveraged to understand whether the assembly “rules” that govern the distributions of macroorganisms apply equally as well to microbial community assembly across multiple nested spatial scales.
Microbial community composition and ecosystem processes
Box 1. Future questions for NEON-enabled microbial ecology
Spatial and temporal dynamics
- How do soil microbial communities vary across spatial scales (sites, ecoregions, continents) and temporal scales (seasonal, annual, decadal)? What are the important drivers?
- How does microbial diversity scale over space, time, and taxonomic resolution?
- What are the patterns of temporal or spatial autocorrelation in soil microbial communities?
- How do macroecological and biogeographical patterns of soil microbial communities vary across spatial scales? Do they follow the same “rules” as for macroorganisms?
Ecosystem processes
- How can we effectively include microbial communities in ecosystem and earth system models?
- Can we predict ecosystem functions and services (e.g., C flux) from microbial taxonomic or functional composition?
- How can we forecast future changes in these processes?
Going beyond NEON data
- How can we design future studies to take advantage of and complement NEON observatory, biorepository, and assignable asset data?
- What are some best practices for fostering interdisciplinary team science, synthesizing a variety of NEON data products to answer complex ecological problems?
Expanded use of NEON samples and infrastructure
For research needs that are not precisely met by existing NEON data streams, NEON also offers two programs to help researchers leverage NEON sample collections and field infrastructure. The NEON Biorepository Data Portal allows researchers to request access to biological samples, including frozen subsamples of the soil and extracted DNA used to generate the soil microbe marker gene sequence data product, in order to conduct their own laboratory analyses. Researchers interested in using a different sequencing protocol or conducting a functional assay, for example, may take advantage of this program. As another example, a researcher interested in studying food webs may request biorepository samples to identify arthropods in pitfall traps beyond beetles—in addition to NEON soil microbe amplicon, abiotic, and metagenomic data sets. Furthermore, the NEON Assignable Assets Program allows researchers to request the use of specialized NEON data collection infrastructure for their own research, temporarily adding to the sampling design of NEON field sites. To return to our example, the researcher interested in studying food webs may conduct some of their own on-site sampling using similar designs to survey nematodes, via the Assignable Assets Program’s Observational Sampling Infrastructure. The flexibility built into the NEON data stream infrastructure greatly expands the potential to accommodate future research directions that were not part of the original design.
Going beyond NEON data
Using networks to synthesize ecological and environmental research offers promising new avenues for scientists to create holistic representations of natural processes—particularly in fields that account, for complex, large-scale phenomena such as biogeography (Schrodt et al. 2019). As a nationwide monitoring network, NEON provides broad coverage for the collection of ecological and environmental data; however, it is limited in its ability to provide sites for field experiments. The US Long-Term Ecological Research network (LTER) provides complementary infrastructure for experimental studies in a variety of ecosystems and may help to elucidate the processes driving patterns observed in NEON data (Jones et al. 2020). At the time of writing, twelve NEON sites are co-located with LTER sites.
Although scientific research aims to explain natural processes, it is also an inherently social process in which tacit, socially transferred knowledge is especially important for the extension of methods to novel or synthetic contexts (Collins 1974). The unprecedented spatial and temporal scales of the soil microbe marker gene sequence data provided by NEON represent such a context to develop best practices for team science. Methodological and epistemological challenges involved in using these data led the authors of this paper to recognize the necessity of having a team of collaborators to validate methods and test results before formally embedding them into a standard algorithmic process. While there is some research on the social and technical factors that allow for effective team science (Rhoten 2003, Oliver et al. 2018), there is room to consider how to best foster collaborations that can synthesize the wide variety of NEON data products to address interdisciplinary problems (e.g., Nagy et al. 2021). Interdisciplinary collaborations have been identified as avenues for fruitful and novel research in ecology and the environment as discussed above, but especially for understanding complex socio-environmental issues (Palmer et al. 2016). They also provide opportunities for graduate students in ecology to realize and develop the unique expertise they bring to the team (Giorgio et al. 2020). Factors that may have contributed to our ability to complete this project include the diversity of expertise across our team members, which included soil microbial ecologists, molecular biologists, and statisticians, as well as a diversity of career stages that allowed graduate student members to receive real-time feedback from an informal community of mentors. One of the main challenges to our project was the inability for authors to hold meetings in person after the Summit—a challenge which was exacerbated by the COVID 19 pandemic, and which previous studies have identified as a potential hindrance to information sharing (Rhoten 2003). Future studies should seek to understand what types of social and technical configurations facilitate or hinder data-intensive, interdisciplinary team science, and how data-sharing centers such as NEON can take advantage of these findings to make their data more accessible and useful across diverse research contexts.
Conclusions
We present neonMicrobe, a processing pipeline for the R statistical computing environment that streamlines access to NEON microbe marker gene sequence data. Our approach adapts state-of-the-art sequence processing pipelines for current NEON marker gene sequencing approaches. We have validated the efficacy of recommended quality filtering parameters in our pipeline. The collaborative effort represented here speaks to the utility of open science, and our publicly available data wrangling tools can be adopted for user-specific applications. We expect this community resource will expedite NEON-enabled science and herald a new era of continental-scale analysis for microbial community dynamics.
Acknowledgments
We acknowledge the funding from NSF (DEB 2026815 to KN, DEB 1926335 to KP, DEB 1926438 to KZ). Funding for ZW is supported by DEB 1638577. Funding for RB is supported by NSF OAC-1940062 & DBI-1743442 and USDA NIFA AFRI 2020-68013-30934. We would like to thank Stanford University and the Stanford Research Computing Center, as well as the Hummingbird Computational Cluster at the University of California, Santa Cruz, for providing computational resources and support that contributed to these research results. The inspiration for the plotEEProfile function came from a GitHub Issue comment by Rémi Maglione.
Open Research
Data Availability Statement
Code is available as Data S1–S3 and is available from Zenodo: https://doi.org/10.5281/zenodo.5553228