FASTQ Files
After sequencing, you receive your data as FASTQ files. FASTQ files are a plain text format that stores both the sequence and the sequencer’s confidence estimate for each base. Understanding this format helps you interpret quality reports and make sense of what the downstream tools are doing.
The FASTQ format
Each read in a FASTQ file is represented by four lines:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*'))**55CCF>>>>>>CCCCCCC65
| Line | Contains |
|---|---|
| Line 1 | Read identifier (instrument, run, flow cell, coordinates, barcode) |
| Line 2 | DNA sequence |
| Line 3 | + (separator, sometimes repeats the identifier) |
| Line 4 | Quality scores, one character per base |
Because each read is exactly four lines, counting reads is simple — divide the total line count by four.
# if unzipped file
echo $(($(wc -l < input.fastq) / 4))
# if zipped
echo $(($(zcat input.fastq.gz | wc -l) / 4))Phred quality scores
The quality score for each base is encoded as a single ASCII character. The underlying value (called a Q score or Phred score) represents the sequencer’s estimate of the probability that the base call is wrong:
| Q score | Error probability | Accuracy |
|---|---|---|
| Q10 | 1 in 10 | 90% |
| Q20 | 1 in 100 | 99% |
| Q30 | 1 in 1000 | 99.9% |
| Q40 | 1 in 10,000 | 99.99% |
A Q30 score is generally the minimum acceptable threshold. Run reports from Illumina sequencers report the percentage of bases at or above Q30 as a key quality metric.
Illumina sequencing reads the template base-by-base as a fluorescent cluster. As sequencing cycles progress, clusters lose synchrony (phasing). Some copies in a cluster get one base ahead or behind, blurring the signal. This accumulates over the run, causing quality to degrade toward the 3’ end of every read. This is normal and is why trimming the 3’ ends of reads is standard practice.
In case you are interested to learn more about these scores, here are a few resources to learn more:
What you actually receive from the core
For a paired-end run, you should expect to receive:
sample_R1.fastq.gz- forward reads (the primer end)sample_R2.fastq.gz- reverse reads (sequenced from the other end)
Files are usually gzip-compressed (.fastq.gz or .fq.gz), as they can be large.
Some cores deliver a single multiplexed file with all samples mixed together. These reads have barcodes embedded in the sequence or the read header that allows the reads to be separated into samples. In that case, demultiplexing is your first computational step.