FASTQ Files

After sequencing, you receive your data as FASTQ files. FASTQ files are a plain text format that stores both the sequence and the sequencer’s confidence estimate for each base. Understanding this format helps you interpret quality reports and make sense of what the downstream tools are doing.

The FASTQ format

Each read in a FASTQ file is represented by four lines:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*'))**55CCF>>>>>>CCCCCCC65

Line	Contains
Line 1	Read identifier (instrument, run, flow cell, coordinates, barcode)
Line 2	DNA sequence
Line 3	`+` (separator, sometimes repeats the identifier)
Line 4	Quality scores, one character per base

Code and tool examples

Because each read is exactly four lines, counting reads is simple — divide the total line count by four.

# if unzipped file
echo $(($(wc -l < input.fastq) / 4))

# if zipped
echo $(($(zcat input.fastq.gz | wc -l) / 4))

Phred quality scores

The quality score for each base is encoded as a single ASCII character. The underlying value (called a Q score or Phred score) represents the sequencer’s estimate of the probability that the base call is wrong:

Q score	Error probability	Accuracy
Q10	1 in 10	90%
Q20	1 in 100	99%
Q30	1 in 1000	99.9%
Q40	1 in 10,000	99.99%

A Q30 score is generally the minimum acceptable threshold. Run reports from Illumina sequencers report the percentage of bases at or above Q30 as a key quality metric.

Why do quality scores drop at the 3’ end?

Illumina sequencing reads the template base-by-base as a fluorescent cluster. As sequencing cycles progress, clusters lose synchrony (phasing). Some copies in a cluster get one base ahead or behind, blurring the signal. This accumulates over the run, causing quality to degrade toward the 3’ end of every read. This is normal and is why trimming the 3’ ends of reads is standard practice.

More information on PHRED scores

In case you are interested to learn more about these scores, here are a few resources to learn more:

GATK PHRED score overview

Illumina Quality Scorres for Next-Generation Sequencing

What you actually receive from the core

For a paired-end run, you should expect to receive:

sample_R1.fastq.gz - forward reads (the primer end)
sample_R2.fastq.gz - reverse reads (sequenced from the other end)

Files are usually gzip-compressed (.fastq.gz or .fq.gz), as they can be large.

Some cores deliver a single multiplexed file with all samples mixed together. These reads have barcodes embedded in the sequence or the read header that allows the reads to be separated into samples. In that case, demultiplexing is your first computational step.