Linux: Processing raw sequencing fastq files

Introduction

We will start analysing the data we generated with Amy Buck to study the interaction between the nematode Heligmosomoides bakeri and MODEK (intestinal epithelium) cells from the mouse Mus musculus. You can read more about this project in the paper Disentangling sRNA-Seq data to study RNA communication between species

Illumina produces the most common sequencing technology being used Today. Certain applications, for instance for de novo sequencing of large genomes, can greatly benefit from “third-generation” sequencing technology from PacBio or Oxford Nanopore, which can produce very long reads. But for sequencing small RNA, Illumina is still the ideal technology. Currently it can be used to produce 50-250nt reads, either paired-end reads or a single read. For sRNA-Seq, the shortest length and a single read is the best option.

As you may imagine, large computers may be required to properly analyse the shear volume of data generated by current sequencers. Nevertheless, there are many alternatives, and one option is always to select a small fraction of the data and analyse this first, on our personal computers. For this exercise, we will first start with a small part of one of the experiments, and later on we will return to the full dataset.

The file containing the reads we will be using can be accessed here: MODEK_100k.fq.gz. Please download the file and save it in your working directory.

In this practical we will be working with the Linux terminal. You should be able to work through this practical on any computer with a terminal like this, including Mac OS X (with the Terminal.app) and Windows 10 (with the Linux subsystem installed). A command line terminal is the ideal environment to work on large raw sequencing files. Also, many of the programs we will need only run on Linux.

The fastq format

Most output from sequencers nowadays can be found in a format called fastq. It is a plain text format, very similar to the fasta format you might already know about. The “q” at the end refers to “quality” and implies that for each sequenced position, there is a value of quality in addition to the nucleotide.

You might have not thought about this, but the process of sequencing is not perfect, and there is uncertainty associated to every base “call”. That is, with the information that the sequencer acquired (photographs with fluorescence signal in this case) a decision needs to be made as to which nucleotide better represents that information. So, a probability is calculated for each nucleotide, and the nucleotide with the best probability is reported. To avoid having to store very small probability values, they are usually reported as Phred quality scores. These scores are related logarithmically to the probabilities, as shown in the following table:

Phred quality	Probability that the base is incorrect	Base call accuracy
10	1 in 10	90 %
20	1 in 100	99 %
30	1 in 1000	99.9 %
40	1 in 10000	99.99 %

Now let’s have a look at our fastq file. You can use the following code in your Terminal to have a look at the first 4 lines of the file:

gzip -dc MODEK_100k.fq.gz | head -n 4

## @D00261:386:CA8VYANXX:5:1312:6310:25644 1:N:0:ATCACG
## GTCTACGGCCATACCACCCTGAACGCGCCCGATCTCGTTGGAATTCTCGG
## +
## CCCCCCGBGGGGGGGGGGGGGGGGGGGGGGG<=<EFGGGGGGGGGGGGGG

The first line contains an identifier (starting with a @), and the next line the sequence. Then there is a 3rd line that starts with a + which can optionally contain the same identifier (or nothing, to save space). The 4th line contains a series of characters that represent the quality: each character corresponds to the quality of one sequenced base.

Since the real quality values can’t fit in one space, they are transformed into an ASCII character (a single “letter”). The first ASCII characters can’t be printed in a file (they include things like a “null” character, the “bell” sound and the “backspace”). So, a value of 33 is first added to the Phred score and then the corresponding ASCII character is selected. You can see what the decimal ASCII table looks like:

       0 nul    1 soh    2 stx    3 etx    4 eot    5 enq    6 ack    7 bel
       8 bs     9 ht    10 nl    11 vt    12 np    13 cr    14 so    15 si
      16 dle   17 dc1   18 dc2   19 dc3   20 dc4   21 nak   22 syn   23 etb
      24 can   25 em    26 sub   27 esc   28 fs    29 gs    30 rs    31 us
      32 sp    33  !    34  "    35  #    36  $    37  %    38  &    39  '
      40  (    41  )    42  *    43  +    44  ,    45  -    46  .    47  /
      48  0    49  1    50  2    51  3    52  4    53  5    54  6    55  7
      56  8    57  9    58  :    59  ;    60  <    61  =    62  >    63  ?
      64  @    65  A    66  B    67  C    68  D    69  E    70  F    71  G
      72  H    73  I    74  J    75  K    76  L    77  M    78  N    79  O
      80  P    81  Q    82  R    83  S    84  T    85  U    86  V    87  W
      88  X    89  Y    90  Z    91  [    92  \    93  ]    94  ^    95  _
      96  `    97  a    98  b    99  c   100  d   101  e   102  f   103  g
     104  h   105  i   106  j   107  k   108  l   109  m   110  n   111  o
     112  p   113  q   114  r   115  s   116  t   117  u   118  v   119  w
     120  x   121  y   122  z

What Phred score is associated with the first few bases of the first sequence in our file?

You can find the full ASCII table in your Terminal by typing man ascii

Quality Control of fastq files

The easiest way to check the quality of one or more fastq files is using a program called FastQC. This program does also have a graphical interface but here we will be using it from the terminal.

Many command line programs have a built-in help function, that can usually be accessed like this:

fastqc -h

## 
##             FastQC - A high throughput sequence QC analysis tool
## 
## SYNOPSIS
## 
##  fastqc seqfile1 seqfile2 .. seqfileN
## 
##     fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] 
##            [-c contaminant file] seqfile1 .. seqfileN
## 
## DESCRIPTION
## 
##     FastQC reads a set of sequence files and produces from each one a quality
##     control report consisting of a number of different modules, each one of 
##     which will help to identify a different potential type of problem in your
##     data.
##     
##     If no files to process are specified on the command line then the program
##     will start as an interactive graphical application.  If files are provided
##     on the command line then the program will run with no user interaction
##     required.  In this mode it is suitable for inclusion into a standardised
##     analysis pipeline.

In many cases, we can simply use the default options (but do consider looking through the help of this and other programs once you are more familiar with what is going on). At a minimum, we specify what file we would like to analyse.

fastqc MODEK_100k.fq.gz

The fastqc program produces an html file that we can open with any web browser. If you were able to execute the previous command, you should be able to open the resulting file, possibly even with another command: open MODEK_100k_fastqc.html. If you are getting stuck here, you can have a look at the result you should be getting in the following link: MODEK_100k_fastqc.html

Have a look at the fastqc results and see if you can answer:

How many sequences were in our file? (can you counts these in the terminal?)
How long are each of our sequences?
Do you think the base quality is good enough?
What do you think about the base frequencies at each position in the sequences?
What do you interpret from the Adapter Content section?

Adapter removal

When we are trying to sequence molecules that are smaller than the sequencing reads that Illumina produces, we will naturally start to sequence part of the 3’ adapter. For example, a typical miRNA will be 22 bases long, and if the sequencing read is 50 bases, the last 28 bases of the read will contain the adapter sequence. If we are not careful with this, we can have a lot of trouble when we try to map or assemble our sequences.

Although there are programs that can try to guess the adapter sequence from a fastq file, it is always better to try to find the actual adapter sequence that was used by the sequencing facility. Throughout the years, different adapter sequences have been used, and in some cases extra barcodes have been added for different purposes. We need to find out what was done during library prep to make sure we can correctly remove any bases that do not correspond to the molecules we are interested in studying.

In the case of our experiment, a very standard protocol was used, with no additional barcode, and the standard Illumina Small RNA adapter:

TGGAATTCTCGGGTGCCAAGG

There are also several tools that can be used to remove adapter sequences, some of these are mentioned at the end of this document. In this pratical we will be using reaper which we developed specifically for dealing with sRNA-Seq data. Once you have it installed, you can have a look at the help page:

reaper -h

## DON'T PANIC
## 
## Required options
## -geom <mode>   mode in {no-bc, 3p-bc, 5p-bc}
## -meta <fname>  file with geometry-dependent format. Required columns:
##    Geometry    Columns:
##       no-bc          3p-ad     -       -      -    tabu
##       3p-bc          3p-ad  barcode  3p-si    -    tabu
##       5p-bc          3p-ad  barcode    -    5p-si  tabu
## bc=barcode, ad=adaptor, si=sequence insert
## Columns 3p-si, 5p-si, 3p-ad and tabu may all be empty
## Alternatively, to express absence, a single hyphen may be used
## 
## Important options
## -i <fname>        input stream (gzipped file allowed) (default STDIN)
## -clean-length <int> minimum allowed clean length (default 0)
## -guard <int>      protect first <int> bases in read from adapter and tabu matching
## -restrict <int>   only use the first <int> bases of adapter or tabu sequence (default 18)
##                   This is to avoid false positive matches
## -tri <threshold>  filter out reads with tri-nt score > threshold
##                   a reasonable <threshold> is 35
## -qqq-check  <val>/<winlen>  cut sequence when median quality drops below <val>
## -qqq-check  <val>/<winlen>/<winofs> as above, cut at <winofs> (default 0)
## -qqq-check  <val>/<winlen>/<winofs>/<readofs> as above, start at <readofs>
## -dust-suffix <threshold> dust theshold for read suffix (default 0, suggested 20)
## -nnn-check <count>/<outof> (default 0/0)
##       disregard read onwards from seeing <count> N's in <outof> bases
## 
## Alignment options
## Options to specify when part of an alignment triggers a match:
## -3p-global  l/e[/g[/o]]  (default 14/2/1/0)
## -3p-prefix  l/e[/g[/o]]  (default 8/2/0/2)
## -3p-barcode l/e[/g[/o]]  (default 0/6/1/0)
## -5p-barcode l/e[/g[/o]]  (default 0/0/0/2)
## -5p-sinsert l/e[/g[/o]]  (default 0/2/1/10)
## -mr-tabu    l/e[/g[/o]]  (default 16/2/1/0)
## -3p-head-to-tail l minimal trailing perfect match length (default 0)
##    syntax used in the above:
##       l  <int> minimum length required to count sub-alignment as match
##       e  <int> maximum allowed edit distance
##       g  <int> [optional, not active when set to 0] maximum allowed number of gaps
##       o  <int> [optional, not active when set to 0] offset:
##             o= 5  requires alignment to start in the first five bases of adaptor
##             o=-5  requires alignment to end in the last five bases of adaptor
## -swp M/S/G match/substitution/gap gain/cost/cost (default 4/1/3)
## 
## Input/output options
## --fasta-in      read FASTA input
## -record-format <format string> (record description, default @%I%A%n%R%n+%#%Q%n)
##    [ -record-format syntax is output when supplying --record-format ]
## -record-format2 <format string> (simple line formats, one field per line):
##       R  read
##       I  read identifier
##       Q  quality scores
##       D  discard field
## 
## -basename <pfx>   pfx.lint.gz, pfx.clean.gz pfx.report etc will be constructed
## -format-clean <format string> (output for clean reads)
## -format-lint <format string> (output for filtered reads)
##    -format-clean/lint specification syntax:
##       %R  read
##       %C  clean read
##       %Z  clean read padded with Ns if necessary
##       %V  reverse complement of clean read
##       %I  read identifier
##       %Q  clean or input read quality (for clean / lint file respectively)
##       %X  read count (only applicable if -record-format is used)
##       %Y  input read quality
##       %q<c>  clean input read quality padded with character <c>
##       %A  annotation field
##       %L  clean read length
##       %M  message describing cause for filtering (lint file)
##       %T  trinucleotide complexity score (clean/lint file)
##       %U  dUst sUffix complexity information
##       %3  best read/3p-adaptor alignment
##       %=  alignment characteristics
##             mt=matchtype
##             sc=suffix-complexity
##             ht=head-tail-match
##             nn=N-match-offset
##             bb=B-match-offset
##             aa=Polya-offset
##             qq=Quality-trim-offset
##       %n  newline
##       %J  record offset, unique for each read. Use to match paired-end reads
##       %f  fastq line number based on standard fastq format
##       %t  tab
##       %%  percent sign
##    Anything else is copied verbatim
## -debug [acl]+     a=alignments c=clean l=lint
## -sample c/l       if debug, sample every c/l clean/lint read
## --nozip           do not output gzipped files
## --noqc            do not output quality report files
## 
## Miscellaneous options
## --bcq-early       perform early 'B' quality filtering (when reading sequences)
## --bcq-late        perform late 'B' quality filtering (before outputting sequences)
## --full-length     only allow reads not shortened in any filter step
## --keep-all        delete rather than discard reads (e.g. tabu match, missing 5p-sinsert)
## -trim-length <int>     cut reads back to length <int>
## -polya <int>      remove trailing A's if length exceeds <int>
## -sc-max <f>       threshold for complexity of suffix following prefix match (default 0.25)
## 
## Options for use when running reaper with -geom no-bc
## -3pa <three prime adaptor>
## -5psi <five prime sequence insert>
## -tabu <tabu sequence>

The minimal parameters that we need to use are the “geometry” of the reads (do they have a barcode, and where) and the sequence of adapter(s).

The following diagram shows how reaper considers the geometry of a sequencing read (called “Message” below). There is more information in reaper’s manual.

         |  < - - - - - - - - - - - - Read geometry - - - - - - - - - - - - - ->
         |
Geometry |  <---------- 5p region --------->      Message  <---- 3p region ---->
---------+----------------------------------------------------------------------
 no-bc   |   < 5p-ad >                              M         [ 3p-ad ]
 3p-bc   |   < 5p-ad >                              M    ( 3p-si barcode 3p-ad )
 5p-bc   |   < 5p-ad > ( barcode ) [( 5p-si )]      M         [ 3p-ad ]

We can also specify the start of the output files that reaper will generate.

reaper -geom no-bc -3pa TGGAATTCTCGGGTGCCAAGG -i MODEK_100k.fq.gz -basename MODEK_100k

## adapter TGGAATTCTCGGGTGCCA
## ---
## mRpm   million reads per minute
## mNpm   million nucleotides per minute
## mCps   million alignment cells per second
## lint   total removed reads (per 10K), sum of columns to the left
## 25K reads per dot, 1M reads per line  seconds  mr mRpm mNpm mCps {error qc  low  len  NNN tabu nobc cflr  cfl lint   OK} per 10K
## ....
## [reaper] check 0 errors, 0 reads truncated, 100000 clean, 0 lint, 100000 total

If everything worked fine, there will be a series of files in your folder, starting with the basename you specified above:

ls MODEK_100k.*

## MODEK_100k.fq.gz
## MODEK_100k.lane.clean.gz
## MODEK_100k.lane.report.clean.len
## MODEK_100k.lane.report.clean.nt
## MODEK_100k.lane.report.input.nt
## MODEK_100k.lane.report.input.q
## MODEK_100k.lint.gz
## MODEK_100k.sumstat

This file in particular has useful information:

cat MODEK_100k.lane.report.clean.len

## length   count
## 0    0
## 1    0
## 2    0
## 3    0
## 4    0
## 5    0
## 6    0
## 7    0
## 8    0
## 9    0
## 10   0
## 11   0
## 12   0
## 13   0
## 14   0
## 15   0
## 16   0
## 17   0
## 18   2647
## 19   2722
## 20   3163
## 21   12221
## 22   59150
## 23   6127
## 24   1837
## 25   1028
## 26   893
## 27   881
## 28   886
## 29   1431
## 30   654
## 31   561
## 32   1421
## 33   1079
## 34   1066
## 35   634
## 36   336
## 37   234
## 38   198
## 39   237
## 40   149
## 41   89
## 42   88
## 43   0
## 44   0
## 45   0
## 46   0
## 47   0
## 48   0
## 49   0
## 50   268

And the final output file, with the trimmed sequences, is:

gzip -dc MODEK_100k.lane.clean.gz | head

## @D00261:386:CA8VYANXX:5:1312:6310:25644
## GTCTACGGCCATACCACCCTGAACGCGCCCGATCTCGT
## +
## CCCCCCGBGGGGGGGGGGGGGGGGGGGGGGG<=<EFGG
## @D00261:386:CA8VYANXX:5:1213:16997:8428
## CTAGACTGAGGCTCCTTGAGGA
## +
## BBBBBGGGGGGGGGGGGGGGGG
## @D00261:386:CA8VYANXX:5:1116:8133:95825
## TGAGGTAGTAGGTTGTATGGTT

By default, the output format is also fastq, although any output format can be obtained by using the -format-clean parameter.

For instace, to obtain a standard fasta format (discarding the quality lines), we can use the following:

reaper -geom no-bc -3pa TGGAATTCTCGGGTGCCAAGG -i MODEK_100k.fq.gz -basename MODEK_100k -format-clean '>%I%n%C%n'

## adapter TGGAATTCTCGGGTGCCA
## ---
## mRpm   million reads per minute
## mNpm   million nucleotides per minute
## mCps   million alignment cells per second
## lint   total removed reads (per 10K), sum of columns to the left
## 25K reads per dot, 1M reads per line  seconds  mr mRpm mNpm mCps {error qc  low  len  NNN tabu nobc cflr  cfl lint   OK} per 10K
## ....
## [reaper] check 0 errors, 0 reads truncated, 100000 clean, 0 lint, 100000 total

gzip -dc MODEK_100k.lane.clean.gz | head

## >D00261:386:CA8VYANXX:5:1312:6310:25644
## GTCTACGGCCATACCACCCTGAACGCGCCCGATCTCGT
## >D00261:386:CA8VYANXX:5:1213:16997:8428
## CTAGACTGAGGCTCCTTGAGGA
## >D00261:386:CA8VYANXX:5:1116:8133:95825
## TGAGGTAGTAGGTTGTATGGTT
## >D00261:386:CA8VYANXX:5:2101:4546:85660
## CGCAATGAAGGTGAAGGGCC
## >D00261:386:CA8VYANXX:5:2205:12879:71398
## GCTCTCTGTCCCGCCTGGTCCTG

Exercises:

Read in to R the output file from reaper: MODEK_100k.lane.report.clean.len. Generate a barplot to visualise the length distribution of the clean sequences.

Use the Biostrings package to import the clean sequences in fasta format MODEK_100k.lane.clean.gz. From these data try to generate a plot that is equivalent to the one you generated above.

Advanced: generate an equivalent plot to the last one, but using stacked bars that show with a different colour the fraction at each length that stats with each of the four nucleotides.

Final exercise

Download and unpack the file with all the sRNA-Seq results for this experiment. To make this exercise a bit faster, I prepared a file that contains only 100k sequences from each sample: sRNA_data_100k.tar.

Remember that to download a file you can also right-click on a link and copy/paste the link address into your command line and use a command like: wget http://full_copied_address/data/sRNA_data_100k.tar
To unpack a tar file you can use: tar -xvf sRNA_data_100k.tar

Run fastqc and reaper on each of the fastq file. If you already know how to write a script you can do so.
Modify your R program to generate the length distributions as one figure for each file.

Final tip: MultiQC

Basically any sequencing project will include multiple samples, and it becomes tedious to have to go through each fastqc output individually. Fortunately someone has already thought of this problem, and designed a program called MultiQC. It is very easy to use, and it generates a single report from multiple fastqc reports. It can also generate combined reports for many other sequencing-related tools.

To run this program, we simply need to specify the folder that contains the individual reports, and a folder for our new output. Something like this:

multiqc folder_with_reports -o folder_for_combined_report

You can have a look at what a MultiQC report should look like for our files, here: multiqc_report_100k.html