Sequencing

The amplicon library is loaded onto a sequencer, which reads the nucleotide sequence of millions of DNA fragments simultaneously. The sequencer produces a FASTQ file per sample (or per run, if not yet demultiplexed).

Primer design

Universal primers target the conserved region sequences. “Universal” is a relative term — primers vary in how well they match across the full breadth of bacterial diversity. Some lineages (e.g., Verrucomicrobia, Tenericutes, Archaea) can be systematically underamplified by common primer sets.

Primers for Illumina sequencing are ordered with adapter sequences attached so the amplicon can be directly sequenced without additional ligation. Some protocols use a two-step PCR: first to amplify the target, second to add the full sequencing adapters and barcodes.

How samples become reads

Understanding the flow from plate to data file makes many downstream analytical decisions easier to interpret:

Extraction and amplification: DNA is extracted from each sample and the 16S region is amplified with barcoded primers. Each sample gets a unique barcode, so its reads can later be identified as belonging to that specific sample despite being pooled together.
Pooling: Amplified libraries from all samples are combined into a single tube and loaded onto the sequencer together. All samples share the same flow cell.
Sequencing: The machine randomly samples from the mixed pool. It does not allocate reads equally across samples; it samples stochastically from whatever DNA is present. This means that there is often largely different sampling depth across samples.
Demultiplexing output: After sequencing, reads are sorted back to individual samples using their barcodes, producing one FASTQ file (or file pair for paired-end sequencing) per sample.

Always remember the pooling!

The stochastic nature of step 3 is why read counts vary across samples even when pooling was careful: differences in pipetting accuracy, PCR efficiency, DNA quality, and random cluster formation all contribute. This technical variation in sequencing depth is the reason normalization or rarefying is required before comparing diversity across samples. Without this, the largely different sequencing depths may affect downstream diversity estimates and the detection of low-abundance taxa.

PCR duplicates and the relative abundance problem

Because PCR is exponential, relative abundance in the final library does not equal relative abundance in the original sample. A taxon at 1% true abundance might represent 1% or 3% or 0.3% in your sequence data depending on how efficiently it amplified. This is why 16S results are relative abundances, not absolute counts — and why absolute quantification requires additional approaches (e.g., spike-in standards, qPCR).

Single-end vs. paired-end sequencing

Illumina instruments can read each DNA fragment from one end only (single-end) or from both ends (paired-end), producing a forward read (R1) and a reverse read (R2). For paired-end runs, if the read length is sufficient, R1 and R2 overlap in the middle of the amplicon. This overlapping region is used to error-correct both reads. Disagreements in the overlap are resolved using quality scores, and this substantially improves accuracy.

Single-end may be sufficient for short amplicons like V4 (~250 bp) with 2×150 bp reads, though paired-end is still preferred.
Paired-end is required for longer amplicons like V3–V4 (~460 bp), where overlap is essential for merging and error correction.

Paired-end sequencing

As it is generally the standard, the rest of this workshop will focus on paired-end data. However, the tools referenced in the coding examples can be adjusted with general ease to run on single-end data as required.

Sequencing platforms

MiSeq is the most common platform for 16S work. It supports longer reads (2×250 bp or 2×300 bp paired-end) and has a well-established track record for amplicon sequencing. Throughput is lower (~25M reads/run) and cost-per-run is higher than HiSeq/NovaSeq, but the read quality and length are well-suited to 16S amplicons (~450 bp for V3–V4).

NovaSeq produces vastly more reads but with shorter read lengths (2×150 bp standard). This is sufficient for V4-only studies and can be used when large sample numbers or high depth are needed.

What the sequencing core delivers

You will typically receive:

One FASTQ file per sample per read direction (R1 and R2), or a single multiplexed FASTQ with all samples combined
A sample sheet mapping barcodes to sample IDs
A run report with quality metrics (% clusters passing filter, Q30 score)

Inspect the run report before beginning analysis. A run with <75% clusters passing filter or Q30 <70% warrants a conversation with the core facility.