Linux: Processing raw sequencing fastq files


Experimental design

We will start analysing the data we generated with Amy Buck to study the interaction between the nematode Heligmosomoides bakeri and MODEK (intestinal epithelium) cells from the mouse Mus musculus. You can read more about this project in the paper Disentangling sRNA-Seq data to study RNA communication between species

We have three types of samples:

  1. MODEK_ctrl (untreated mouse cells)
  2. Pure_EV (purified worm EVs)
  3. MODEK_EV (mouse cells, treated with worm EVs)

We will use the first two types of sample to characterise the unique properties of sRNAs coming from these two different organisms, and the third sample to see if we can actually detect worm sRNAs inside of mouse cells.


Illumina sRNA-Seq data

Illumina produces the most common sequencing technology being used Today. Certain applications, for instance for de novo sequencing of large genomes, can greatly benefit from “third-generation” sequencing technology from PacBio or Oxford Nanopore, which can produce very long reads. But for sequencing small RNA, Illumina is still the ideal technology. Currently it can be used to produce 50-250nt reads, either paired-end reads or a single read. For sRNA-Seq, the shortest length and a single read is the best option.

As you may imagine, large computers may be required to properly analyse the shear volume of data generated by current sequencers. Nevertheless, there are many alternatives, and one option is always to select a small fraction of the data and analyse this first, on our personal computers. For this exercise, we will first start with a small part of one of the experiments, and later on we will return to the full dataset.

I always follow this strategy of generating a smaller dataset for any new project, even when I am working with much larger computers. For this practical I have already prepared the input files we will be using. They contain a random subset of 100,000 sequences from the original raw sequencing files.

The files should already be available to you. But if you are running this practical from another location, you should first download and unpack the file with all the sRNA-Seq results for this experiment: sRNA_data_100k.tar. To download a file you can also right-click on a link and copy/paste the link address into your command line and use a command like: wget http://full_copied_address/data/sRNA_data_100k.tar

As mentioned, you should already have the file, just make sure you can see it:

ls -l sRNA_data_100k.tar
-rw-r--r-- 1 gitpod gitpod 14028288 May 16 15:03 sRNA_data_100k.tar

To unpack the tar file you can use:

tar -xvf sRNA_data_100k.tar
MODEK_ctrl_r1_100k.fq.gz
MODEK_ctrl_r2_100k.fq.gz
MODEK_EV_r1_100k.fq.gz
MODEK_EV_r2_100k.fq.gz
Pure_EV_r1_100k.fq.gz
Pure_EV_r2_100k.fq.gz

In this practical we will be working with the Linux terminal. You should be able to work through this practical on any computer with a terminal like this, including Mac OS X (with the Terminal.app) and Windows 10 (with the Linux subsystem installed). A command line terminal is the ideal environment to work on large raw sequencing files. Also, many of the programs we will need only run on Linux.

Today, we will be working on the Linux environment provided through your Gitpod. In order to verify that you can see the first file we will be using, execute the following:

ls -l MODEK_ctrl_r1_100k.fq.gz
-rw-r--r-- 1 gitpod gitpod 2222164 May 13 14:26 MODEK_ctrl_r1_100k.fq.gz


The fastq format

Most output from sequencers nowadays can be found in a format called fastq. It is a plain text format, very similar to the fasta format you might already know about. The “q” at the end refers to “quality” and implies that for each sequenced position, there is a value of quality in addition to the nucleotide.

You might have not thought about this, but the process of sequencing is not perfect, and there is uncertainty associated to every base “call”. That is, with the information that the sequencer acquired (photographs with fluorescence signal in this case) a decision needs to be made as to which nucleotide better represents that information. So, a probability is calculated for each nucleotide, and the nucleotide with the best probability is reported. To avoid having to store very small probability values, they are usually reported as Phred quality scores. These scores are related logarithmically to the probabilities, as shown in the following table:

Phred quality Probability that the base is incorrect Base call accuracy
10 1 in 10 90 %
20 1 in 100 99 %
30 1 in 1000 99.9 %
40 1 in 10000 99.99 %

Now let’s have a look at our fastq file. You can use the following code in your Terminal to have a look at the first 4 lines of the file:

gzip -dc MODEK_ctrl_r1_100k.fq.gz | head -n 4
## @D00261:386:CA8VYANXX:5:1113:1299:65686 1:N:0:ATCACG
## TGAGGTAGTAGGTTGTATGGTTTGGAATTCTCGGGTGCCAAGGAACTCCA
## +
## CCBCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

The first line contains an identifier (starting with a @), and the next line the sequence. Then there is a 3rd line that starts with a + which can optionally contain the same identifier (or nothing, to save space). The 4th line contains a series of characters that represent the quality: each character corresponds to the quality of one sequenced base.

Since the real quality values can’t fit in one space, they are transformed into an ASCII character (a single “letter”). The first ASCII characters can’t be printed in a file (they include things like a “null” character, the “bell” sound and the “backspace”). So, a value of 33 is first added to the Phred score and then the corresponding ASCII character is selected. You can see what the decimal ASCII table looks like:

       0 nul    1 soh    2 stx    3 etx    4 eot    5 enq    6 ack    7 bel
       8 bs     9 ht    10 nl    11 vt    12 np    13 cr    14 so    15 si
      16 dle   17 dc1   18 dc2   19 dc3   20 dc4   21 nak   22 syn   23 etb
      24 can   25 em    26 sub   27 esc   28 fs    29 gs    30 rs    31 us
      32 sp    33  !    34  "    35  #    36  $    37  %    38  &    39  '
      40  (    41  )    42  *    43  +    44  ,    45  -    46  .    47  /
      48  0    49  1    50  2    51  3    52  4    53  5    54  6    55  7
      56  8    57  9    58  :    59  ;    60  <    61  =    62  >    63  ?
      64  @    65  A    66  B    67  C    68  D    69  E    70  F    71  G
      72  H    73  I    74  J    75  K    76  L    77  M    78  N    79  O
      80  P    81  Q    82  R    83  S    84  T    85  U    86  V    87  W
      88  X    89  Y    90  Z    91  [    92  \    93  ]    94  ^    95  _
      96  `    97  a    98  b    99  c   100  d   101  e   102  f   103  g
     104  h   105  i   106  j   107  k   108  l   109  m   110  n   111  o
     112  p   113  q   114  r   115  s   116  t   117  u   118  v   119  w
     120  x   121  y   122  z
  • What Phred score is associated with the first few bases of the first sequence in our file?

You can find the full ASCII table in your Terminal by typing man ascii


Quality Control of fastq files

The easiest way to check the quality of one or more fastq files is using a program called FastQC. This program does also have a graphical interface but here we will be using it from the terminal.

Many command line programs have a built-in help function, that can usually be accessed like this:

fastqc -h
## 
##             FastQC - A high throughput sequence QC analysis tool
## 
## SYNOPSIS
## 
##  fastqc seqfile1 seqfile2 .. seqfileN
## 
##     fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] 
##            [-c contaminant file] seqfile1 .. seqfileN
## 
## DESCRIPTION
## 
##     FastQC reads a set of sequence files and produces from each one a quality
##     control report consisting of a number of different modules, each one of 
##     which will help to identify a different potential type of problem in your
##     data.
##     
##     If no files to process are specified on the command line then the program
##     will start as an interactive graphical application.  If files are provided
##     on the command line then the program will run with no user interaction
##     required.  In this mode it is suitable for inclusion into a standardised
##     analysis pipeline.

In many cases, we can simply use the default options (but do consider looking through the help of this and other programs once you are more familiar with what is going on). At a minimum, we specify what file we would like to analyse.

fastqc MODEK_ctrl_r1_100k.fq.gz


The fastqc program produces an html file that we can open with any web browser. If you were able to execute the previous command, you should be able to open the resulting file, possibly even with another command: open MODEK_ctrl_r1_100k_fastqc.html. If you are getting stuck here, you can have a look at the result you should be getting in the following link: MODEK_ctrl_r1_100k_fastqc.html


Have a look at the fastqc results and see if you can answer:

  • How many sequences were in our file? (can you counts these in the terminal?)
  • How long are each of our sequences?
  • Do you think the base quality is good enough?
  • What do you think about the base frequencies at each position in the sequences?
  • What do you interpret from the Adapter Content section?


Quality Control of multiple fastq files

One of the nice things about Linux, is that if we can execute a command for a single file, we can also execute the same command for any number of files without much additional work. We have been looking at one fastq file, but we actually have six:

ls -l *.fq.gz
## -rw-r--r--  1 cei  staff  2222164 May 13 15:26 MODEK_ctrl_r1_100k.fq.gz
## -rw-r--r--  1 cei  staff  2200709 May 13 15:26 MODEK_ctrl_r2_100k.fq.gz
## -rw-r--r--  1 cei  staff  2570273 May 13 15:26 MODEK_EV_r1_100k.fq.gz
## -rw-r--r--  1 cei  staff  2413050 May 13 15:26 MODEK_EV_r2_100k.fq.gz
## -rw-r--r--  1 cei  staff  2294963 May 13 15:26 Pure_EV_r1_100k.fq.gz
## -rw-r--r--  1 cei  staff  2320936 May 13 15:26 Pure_EV_r2_100k.fq.gz

Most Linux commands allow execution against multiple files simply by using the wildcard *. So let’s run fastqc on all our files. Since we want to keep our files organised, let’s make a folder for these reports.

mkdir sRNA_data_100k.fastqc

fastqc -o sRNA_data_100k.fastqc *.fq.gz

If that worked correctly, we can check the contents of the output folder:

ls -l sRNA_data_100k.fastqc
## total 20192
## -rw-r--r--  1 cei  staff  930443 May 13 16:32 MODEK_ctrl_r1_100k_fastqc.html
## -rw-r--r--  1 cei  staff  969330 May 13 16:32 MODEK_ctrl_r1_100k_fastqc.zip
## -rw-r--r--  1 cei  staff  929975 May 13 16:32 MODEK_ctrl_r2_100k_fastqc.html
## -rw-r--r--  1 cei  staff  968865 May 13 16:32 MODEK_ctrl_r2_100k_fastqc.zip
## -rw-r--r--  1 cei  staff  797146 May 13 16:32 MODEK_EV_r1_100k_fastqc.html
## -rw-r--r--  1 cei  staff  737496 May 13 16:32 MODEK_EV_r1_100k_fastqc.zip
## -rw-r--r--  1 cei  staff  877006 May 13 16:32 MODEK_EV_r2_100k_fastqc.html
## -rw-r--r--  1 cei  staff  878764 May 13 16:32 MODEK_EV_r2_100k_fastqc.zip
## -rw-r--r--  1 cei  staff  827686 May 13 16:32 Pure_EV_r1_100k_fastqc.html
## -rw-r--r--  1 cei  staff  789642 May 13 16:32 Pure_EV_r1_100k_fastqc.zip
## -rw-r--r--  1 cei  staff  821462 May 13 16:32 Pure_EV_r2_100k_fastqc.html
## -rw-r--r--  1 cei  staff  782195 May 13 16:32 Pure_EV_r2_100k_fastqc.zip


MultiQC: combining multiple FastQC (and other) reports

Basically any sequencing project will include multiple samples, and it becomes tedious to have to go through each fastqc output individually. Fortunately someone has already thought of this problem, and designed a program called MultiQC. It is very easy to use, and it generates a single report from multiple fastqc reports. It can also generate combined reports for many other sequencing-related tools.

To run this program, we simply need to specify the folder that contains the individual reports, and a folder for our new output. Like this:

multiqc sRNA_data_100k.fastqc -o sRNA_data_100k.multiqc

You can have a look at what a MultiQC report will look like for our six fastqc files, here: multiqc_report_100k.html


Adapter removal (using Cutadapt)

When we are trying to sequence molecules that are smaller than the sequencing reads that Illumina produces, we will naturally start to sequence part of the 3’ adapter. For example, a typical miRNA will be 22 bases long, and if the sequencing read is 50 bases, the last 28 bases of the read will contain the adapter sequence. If we are not careful with this, we can have a lot of trouble when we try to map or assemble our sequences.

Although there are programs that can try to guess the adapter sequence from a fastq file, it is always better to try to find the actual adapter sequence that was used by the sequencing facility. Throughout the years, different adapter sequences have been used, and in some cases extra barcodes have been added for different purposes. We need to find out what was done during library prep to make sure we can correctly remove any bases that do not correspond to the molecules we are interested in studying.

In the case of our experiment, a very standard protocol was used, with no additional barcode, and the standard Illumina Small RNA adapter was ligated at the 3’ end of each RNA molecule:

TGGAATTCTCGGGTGCCAAGG

There are also several tools that can be used to remove adapter sequences, some of these are mentioned at the end of this document. In this practical we will be using cutadapt which is one of the most popular and versatile tools out there for this purpose.

The authors of Cutadapt have used the following image to illustrate some of the types of adapter removal that their tool can accomplish:

The Read in white represents our actual RNA molecule. The green rectangles are the kinds of adapter that can be included during library prep. In our case, we only have the 3’ Adapter present.

You can have a look at the help page for cutadapt:

cutadapt -h

The options are quite extensive, and at some point you might want to explore them more carefully (alongside the user guide). For now we’ll focus on a few of the most relevant options for sRNAs.

cutadapt version 5.0
Copyright (C) 2010 Marcel Martin <marcel.martin@scilifelab.se> and contributors
Cutadapt removes adapter sequences from high-throughput sequencing reads.

Usage:
    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq
...

Options:
  -h, --help            Show this help message and exit

  -a ADAPTER, --adapter ADAPTER
                        Sequence of an adapter ligated to the 3' end (paired data: of the first read).
                        The adapter and subsequent bases are trimmed. If a '$' character is appended
                        ('anchoring'), the adapter is only found if it is a suffix of the read.

  -u LEN, --cut LEN     Remove LEN bases from each read (or R1 if paired; use -U option for R2). If LEN
                        is positive, remove bases from the beginning. If LEN is negative, remove bases
                        from the end. Can be used twice if LENs have different signs. Applied *before*
                        adapter trimming.

  --poly-a              Trim poly-A tails
  --length LENGTH, -l LENGTH
                        Shorten reads to LENGTH. Positive values remove bases at the end while negative
                        ones remove bases at the beginning. This and the following modifications are
                        applied after adapter trimming.
  --trim-n              Trim N's on ends of reads.
                        
  -m LEN[:LEN2], --minimum-length LEN[:LEN2]
                        Discard reads shorter than LEN. Default: 0
  -M LEN[:LEN2], --maximum-length LEN[:LEN2]
                        Discard reads longer than LEN. Default: no limit
  --max-n COUNT         Discard reads with more than COUNT 'N' bases. If COUNT is a number between 0 and
                        1, it is interpreted as a fraction of the read length.

  -o FILE, --output FILE
                        Write trimmed reads to FILE. FASTQ or FASTA format is chosen depending on input.
                        Summary report is sent to standard output. Use '{name}' for demultiplexing (see
                        docs). Default: write to standard output
  --fasta               Output FASTA to standard output even on FASTQ input.

  --too-short-output FILE
                        Write reads that are too short (according to length specified by -m) to FILE.
                        Default: discard reads
  --too-long-output FILE
                        Write reads that are too long (according to length specified by -M) to FILE.
                        Default: discard reads
  --untrimmed-output FILE
                        Write reads that do not contain any adapter to FILE. Default: output to same
                        file as trimmed reads

At the very least we need to provide the sequence of the adapter, but it is quite frequent that we also want to perform some filtering of the trimmed sequences. If we are interested in small RNAs that can be bound by an Argonaute protein, this probably means a certain range of sequence lengths interests us, such as 20-24. But we might want to include a wider range initially, to be able to explore the quality of our sequences in more detail. In case we do decide to exclude trimmed sequences of certain lengths, we can always save them to separate files and explore later.


A quick check for adapter presence

We can perform a simple check to see if we have the correct adapter sequence. We can manually search for the first 8-12 nucleotides of the adapter using grep. If in doubt, we can try a couple of different adapter sequences.

Try out the following command and see if the result makes sense to you:

gzip -dc MODEK_ctrl_r1_100k.fq.gz | grep TGGAATTCTC

Grep will show only the lines matching the pattern (so only sequences which contain this part of the adapter will be returned). By default the matching part of the text will be coloured in red, and this helps you see if the adapter seems to occur at a certain position of the read. If most of our sequences consist of small RNAs of ~22 nucleotides, we would expect the adapter to start just after that position.


Running cutadapt

The next line of code will run cutadapt on our fastq file, searching for and removing our known adapter sequence. We are also telling cutadapt to discard sequences with any ‘N’ and only keep trimmed sequences between 18 and 40 nt. We prefer the output to be in plain fasta format, since we won’t be using the quality scores any more (we’ve already checked, and they seem fine). Finally, we are specifying the raw fastq input file.

Using the Linux pipe to control what happens to the output is a frequently used trick. For example, we can compress the output, or we might want to first transform the output using another command. (This is also the way to ensure that the --fasta argument works.)

cutadapt -a TGGAATTCTCGGGTGCCAAGG --max-n 0 --minimum-length 18 --maximum-length 40 --fasta MODEK_ctrl_r1_100k.fq.gz | gzip -c > MODEK_ctrl_r1_100k.trimmed.fa.gz
## This is cutadapt 4.0 with Python 3.10.9
## Command line parameters: -a TGGAATTCTCGGGTGCCAAGG --max-n 0 --minimum-length 18 --maximum-length 40 --fasta MODEK_ctrl_r1_100k.fq.gz
## Processing single-end reads on 1 core ...
## Finished in 0.49 s (5 µs/read; 12.27 M reads/minute).
## 
## === Summary ===
## 
## Total reads processed:                 100,000
## Reads with adapters:                    99,784 (99.8%)
## 
## == Read fate breakdown ==
## Reads that were too short:                   0 (0.0%)
## Reads that were too long:                  710 (0.7%)
## Reads with too many N:                       0 (0.0%)
## Reads written (passing filters):        99,290 (99.3%)
## 
## Total basepairs processed:     5,000,000 bp
## Total written (filtered):      2,258,742 bp (45.2%)
## 
## === Adapter 1 ===
## 
## Sequence: TGGAATTCTCGGGTGCCAAGG; Type: regular 3'; Length: 21; Trimmed: 99784 times
## 
## Minimum overlap: 3
## No. of allowed errors:
## 1-9 bp: 0; 10-19 bp: 1; 20-21 bp: 2
## 
## Bases preceding removed adapters:
##   A: 6.6%
##   C: 8.2%
##   G: 11.1%
##   T: 74.1%
##   none/other: 0.0%
## 
## Overview of removed sequences
## length   count   expect  max.err error counts
## 3    30  1562.5  0   30
## 4    35  390.6   0   35
## 5    73  97.7    0   73
## 6    109 24.4    0   109
## 7    83  6.1 0   83
## 8    83  1.5 0   83
## 9    81  0.4 0   81
## 10   190 0.1 1   189 1
## 11   217 0.0 1   214 3
## 12   200 0.0 1   198 2
## 13   256 0.0 1   253 3
## 14   319 0.0 1   309 10
## 15   675 0.0 1   656 19
## 16   1064    0.0 1   1032 32
## 17   1076    0.0 1   1042 34
## 18   1450    0.0 1   1397 53
## 19   597 0.0 1   567 30
## 20   611 0.0 2   579 30 2
## 21   1348    0.0 2   1284 61 3
## 22   889 0.0 2   836 51 2
## 23   823 0.0 2   757 64 2
## 24   968 0.0 2   884 81 3
## 25   1048    0.0 2   989 54 5
## 26   1864    0.0 2   1744 113 7
## 27   6090    0.0 2   5739 329 22
## 28   58664   0.0 2   55614 2866 184
## 29   12175   0.0 2   11425 689 61
## 30   3168    0.0 2   2957 195 16
## 31   2791    0.0 2   2613 159 19
## 32   2807    0.0 2   2649 149 9

If everything worked fine, there will be a new file in your folder, according to the name you specified above:

ls MODEK_ctrl_r1_100k*
## MODEK_ctrl_r1_100k.fq.gz
## MODEK_ctrl_r1_100k.trimmed.fa.gz

The summary output also has some useful information, and you might want to check that it makes sense based on what you expected from your data. It might not always be the case that we find such a high fraction of reads with the adapter:

## Total reads processed:                 100,000
## Reads with adapters:                    99,784 (99.8%)

An extremely frequent base before the removed adapters might suggest your adapter is incomplete? But this could also be due to a bias of the RNA ligase enzyme, or even due to a real over-represented sequence in your sample.

## Bases preceding removed adapters:
##   A: 6.6%
##   C: 8.2%
##   G: 11.1%
##   T: 74.1%
##   none/other: 0.0%

It’s always useful to have a quick check of the final output file with the trimmed sequences:

gzip -dc MODEK_ctrl_r1_100k.trimmed.fa.gz | head
## >D00261:386:CA8VYANXX:5:1113:1299:65686 1:N:0:ATCACG
## TGAGGTAGTAGGTTGTATGGTT
## >D00261:386:CA8VYANXX:5:1309:9208:35751 1:N:0:ATCACG
## TGAGGTAGTAGGTTGTATAGTT
## >D00261:386:CA8VYANXX:5:2104:7527:79110 1:N:0:ATCACG
## TGAGGTAGTAGGTTGTATAGTT
## >D00261:386:CA8VYANXX:5:1202:2605:61268 1:N:0:ATCACG
## AGAAGACGGTCGAACTTGACTATC
## >D00261:386:CA8VYANXX:5:1205:2512:75405 1:N:0:ATCACG
## TGAGGTAGTAGGTTGTGTGGT


Exploring the length of sequenced fragments

If our sample did contain mostly small RNAs, and we have successfully removed the 3’ adapter, the remaining sequences should reflect the length distribution of the small RNAs in our sample. We can explore this using R.

Let’s see how far you can get on the following exercises. Use any type of resource for help: your neighbor, Google or an AI, directly ask any of the trainers. We have also included some hints if you just want to try to work it out on your own.

Exercises:

  • Use the Biostrings package to import the trimmed sequences in fasta format. Start with just one file: MODEK_ctrl_r1_100k.trimmed.fa.gz.
  • Generate a histogram to visualise the length distribution of the trimmed sequences.
  • Hints: functions you might need are readDNAStringSet, width, hist

The result should look something like this:

  • Advanced: generate an equivalent plot to the last one, but using stacked bars that show with a different colour the fraction at each length that starts with each of the four nucleotides.
  • Hints: functions you might need are nucleotideFrequencyAt, split, width, barplot. You will also need to use for loops or apply functions (sapply might be the one to use here).


Important: discuss what we can learn from these plots.


Final exercise

You should already have the six files corresponding to this experiment. If you need to download them again, this is the link: sRNA_data_100k.tar.

  1. Run cutadapt on each of the fastq files. If you already know how to write a bash script you can do so. Cutadapt contains an example of this: many-samples recipe
  2. Modify your R program to generate the length distributions as one figure for each file.

The result should look something like this: