QC Reports
Before any preprocessing begins, you need to understand what your data looks like. Two tools are standard for inspecting the quality of your sequencing output: FastQC and QIIME2’s demux summarize. Both give you a view of per-base quality scores across reads, which is the primary input for choosing truncation parameters.
Checking your results
After demultiplexing if necessary, inspect:
- Read counts per sample: are they roughly balanced? An outlier with dramatically fewer reads may have had poor library prep, a failed PCR, or a barcode collision.
- Unassigned reads: what fraction of reads could not be assigned to any sample? Some unassigned reads are normal (reads from PhiX, primer dimers, index-hopping), but if >20% are unassigned, something went wrong.
Looking at a quality report
FastQC is one standard tool for a quick sanity check on raw reads. It produces plots of:
- Per-base quality scores across the read length
- Per-sequence quality distribution
- Sequence length distribution
- Adapter content
- Overrepresented sequences (primers should appear here)
QIIME2’s demux plugin produces similar quality plots and is often the first thing you look at after importing your data within the QIIME2 framework.
TipCode and tool examples
FastQC
See here for more information: documentation
# to run the analysis on all files in a folder ending in .fastq.gz
fastqc *.fastq.gzQIIME2 Demux Summarize
See here for more information: documentation
qiime2 demux-summarize \
--i-data demux.qza
--o-visualization.qzvA common diagnostic plot is the per-base quality score distribution. These are often a box plot or line plot showing, at each read position, the spread of quality scores across all reads. For this example below, the shaded band is the interquartile range (middle 50% of reads), the solid line is the median, and the thin lines are the whiskers. Toggle between R1 and R2 to see how the quality degrades over the length of the read.
This plot is important in helping us figure out where to truncate our samples based on the quality scores we saw previously in the FASTQ files.
NoteReading the plot
The green zone (Q ≥ 30) is good quality. The yellow zone (Q 20–30) is acceptable but worth watching. The red zone (Q < 20) should be trimmed before analysis. Notice that R2 dips into the yellow zone well before position 250 — this is normal and is why DADA2 truncation lengths for R1 and R2 are set independently.