We will start analysing the data we generated with Amy Buck to study the interaction between the nematode Heligmosomoides bakeri and MODEK (intestinal epithelium) cells from the mouse Mus musculus. You can read more about this project in the paper Disentangling sRNA-Seq data to study RNA communication between species
Illumina produces the most common sequencing technology being used Today. Certain applications, for instance for de novo sequencing of large genomes, can greatly benefit from “third-generation” sequencing technology from PacBio or Oxford Nanopore, which can produce very long reads. But for sequencing small RNA, Illumina is still the ideal technology. Currently it can be used to produce 50-250nt reads, either paired-end reads or a single read. For sRNA-Seq, the shortest length and a single read is the best option.
As you may imagine, large computers may be required to properly analyse the shear volume of data generated by current sequencers. Nevertheless, there are many alternatives, and one option is always to select a small fraction of the data and analyse this first, on our personal computers. For this exercise, we will first start with a small part of one of the experiments, and later on we will return to the full dataset.
The file containing the reads we will be using can be accessed here: MODEK_100k.fq.gz. Please download the file and save it in your working directory.
In this practical we will be working with the Linux terminal. You should be able to work through this practical on any computer with a terminal like this, including Mac OS X (with the Terminal.app) and Windows 10 (with the Linux subsystem installed). A command line terminal is the ideal environment to work on large raw sequencing files. Also, many of the programs we will need only run on Linux.
Most output from sequencers nowadays can be found in a format called fastq. It is a plain text format, very similar to the fasta format you might already know about. The “q” at the end refers to “quality” and implies that for each sequenced position, there is a value of quality in addition to the nucleotide.
You might have not thought about this, but the process of sequencing is not perfect, and there is uncertainty associated to every base “call”. That is, with the information that the sequencer acquired (photographs with fluorescence signal in this case) a decision needs to be made as to which nucleotide better represents that information. So, a probability is calculated for each nucleotide, and the nucleotide with the best probability is reported. To avoid having to store very small probability values, they are usually reported as Phred quality scores. These scores are related logarithmically to the probabilities, as shown in the following table:
| Phred quality | Probability that the base is incorrect | Base call accuracy |
|---|---|---|
| 10 | 1 in 10 | 90 % |
| 20 | 1 in 100 | 99 % |
| 30 | 1 in 1000 | 99.9 % |
| 40 | 1 in 10000 | 99.99 % |
Now let’s have a look at our fastq file. You can use the following code in your Terminal to have a look at the first 4 lines of the file:
gzip -dc MODEK_100k.fq.gz | head -n 4## @D00261:386:CA8VYANXX:5:1312:6310:25644 1:N:0:ATCACG
## GTCTACGGCCATACCACCCTGAACGCGCCCGATCTCGTTGGAATTCTCGG
## +
## CCCCCCGBGGGGGGGGGGGGGGGGGGGGGGG<=<EFGGGGGGGGGGGGGG
The first line contains an identifier (starting with a @), and the next line the sequence. Then there is a 3rd line that starts with a + which can optionally contain the same identifier (or nothing, to save space). The 4th line contains a series of characters that represent the quality: each character corresponds to the quality of one sequenced base.
Since the real quality values can’t fit in one space, they are transformed into an ASCII character (a single “letter”). The first ASCII characters can’t be printed in a file (they include things like a “null” character, the “bell” sound and the “backspace”). So, a value of 33 is first added to the Phred score and then the corresponding ASCII character is selected. You can see what the decimal ASCII table looks like:
0 nul 1 soh 2 stx 3 etx 4 eot 5 enq 6 ack 7 bel
8 bs 9 ht 10 nl 11 vt 12 np 13 cr 14 so 15 si
16 dle 17 dc1 18 dc2 19 dc3 20 dc4 21 nak 22 syn 23 etb
24 can 25 em 26 sub 27 esc 28 fs 29 gs 30 rs 31 us
32 sp 33 ! 34 " 35 # 36 $ 37 % 38 & 39 '
40 ( 41 ) 42 * 43 + 44 , 45 - 46 . 47 /
48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7
56 8 57 9 58 : 59 ; 60 < 61 = 62 > 63 ?
64 @ 65 A 66 B 67 C 68 D 69 E 70 F 71 G
72 H 73 I 74 J 75 K 76 L 77 M 78 N 79 O
80 P 81 Q 82 R 83 S 84 T 85 U 86 V 87 W
88 X 89 Y 90 Z 91 [ 92 \ 93 ] 94 ^ 95 _
96 ` 97 a 98 b 99 c 100 d 101 e 102 f 103 g
104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o
112 p 113 q 114 r 115 s 116 t 117 u 118 v 119 w
120 x 121 y 122 z
You can find the full ASCII table in your Terminal by typing
man ascii
The easiest way to check the quality of one or more fastq files is using a program called FastQC. This program does also have a graphical interface but here we will be using it from the terminal.
Many command line programs have a built-in help function, that can usually be accessed like this:
fastqc -h##
## FastQC - A high throughput sequence QC analysis tool
##
## SYNOPSIS
##
## fastqc seqfile1 seqfile2 .. seqfileN
##
## fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
## [-c contaminant file] seqfile1 .. seqfileN
##
## DESCRIPTION
##
## FastQC reads a set of sequence files and produces from each one a quality
## control report consisting of a number of different modules, each one of
## which will help to identify a different potential type of problem in your
## data.
##
## If no files to process are specified on the command line then the program
## will start as an interactive graphical application. If files are provided
## on the command line then the program will run with no user interaction
## required. In this mode it is suitable for inclusion into a standardised
## analysis pipeline.
In many cases, we can simply use the default options (but do consider looking through the help of this and other programs once you are more familiar with what is going on). At a minimum, we specify what file we would like to analyse.
fastqc MODEK_100k.fq.gzThe fastqc program produces an html file that we can open with any web browser. If you were able to execute the previous command, you should be able to open the resulting file, possibly even with another command: open MODEK_100k_fastqc.html. If you are getting stuck here, you can have a look at the result you should be getting in the following link: MODEK_100k_fastqc.html
When we are trying to sequence molecules that are smaller than the sequencing reads that Illumina produces, we will naturally start to sequence part of the 3’ adapter. For example, a typical miRNA will be 22 bases long, and if the sequencing read is 50 bases, the last 28 bases of the read will contain the adapter sequence. If we are not careful with this, we can have a lot of trouble when we try to map or assemble our sequences.
Although there are programs that can try to guess the adapter sequence from a fastq file, it is always better to try to find the actual adapter sequence that was used by the sequencing facility. Throughout the years, different adapter sequences have been used, and in some cases extra barcodes have been added for different purposes. We need to find out what was done during library prep to make sure we can correctly remove any bases that do not correspond to the molecules we are interested in studying.
In the case of our experiment, a very standard protocol was used, with no additional barcode, and the standard Illumina Small RNA adapter:
TGGAATTCTCGGGTGCCAAGG
There are also several tools that can be used to remove adapter sequences, some of these are mentioned at the end of this document. In this pratical we will be using reaper which we developed specifically for dealing with sRNA-Seq data. Once you have it installed, you can have a look at the help page:
reaper -h## DON'T PANIC
##
## Required options
## -geom <mode> mode in {no-bc, 3p-bc, 5p-bc}
## -meta <fname> file with geometry-dependent format. Required columns:
## Geometry Columns:
## no-bc 3p-ad - - - tabu
## 3p-bc 3p-ad barcode 3p-si - tabu
## 5p-bc 3p-ad barcode - 5p-si tabu
## bc=barcode, ad=adaptor, si=sequence insert
## Columns 3p-si, 5p-si, 3p-ad and tabu may all be empty
## Alternatively, to express absence, a single hyphen may be used
##
## Important options
## -i <fname> input stream (gzipped file allowed) (default STDIN)
## -clean-length <int> minimum allowed clean length (default 0)
## -guard <int> protect first <int> bases in read from adapter and tabu matching
## -restrict <int> only use the first <int> bases of adapter or tabu sequence (default 18)
## This is to avoid false positive matches
## -tri <threshold> filter out reads with tri-nt score > threshold
## a reasonable <threshold> is 35
## -qqq-check <val>/<winlen> cut sequence when median quality drops below <val>
## -qqq-check <val>/<winlen>/<winofs> as above, cut at <winofs> (default 0)
## -qqq-check <val>/<winlen>/<winofs>/<readofs> as above, start at <readofs>
## -dust-suffix <threshold> dust theshold for read suffix (default 0, suggested 20)
## -nnn-check <count>/<outof> (default 0/0)
## disregard read onwards from seeing <count> N's in <outof> bases
##
## Alignment options
## Options to specify when part of an alignment triggers a match:
## -3p-global l/e[/g[/o]] (default 14/2/1/0)
## -3p-prefix l/e[/g[/o]] (default 8/2/0/2)
## -3p-barcode l/e[/g[/o]] (default 0/6/1/0)
## -5p-barcode l/e[/g[/o]] (default 0/0/0/2)
## -5p-sinsert l/e[/g[/o]] (default 0/2/1/10)
## -mr-tabu l/e[/g[/o]] (default 16/2/1/0)
## -3p-head-to-tail l minimal trailing perfect match length (default 0)
## syntax used in the above:
## l <int> minimum length required to count sub-alignment as match
## e <int> maximum allowed edit distance
## g <int> [optional, not active when set to 0] maximum allowed number of gaps
## o <int> [optional, not active when set to 0] offset:
## o= 5 requires alignment to start in the first five bases of adaptor
## o=-5 requires alignment to end in the last five bases of adaptor
## -swp M/S/G match/substitution/gap gain/cost/cost (default 4/1/3)
##
## Input/output options
## --fasta-in read FASTA input
## -record-format <format string> (record description, default @%I%A%n%R%n+%#%Q%n)
## [ -record-format syntax is output when supplying --record-format ]
## -record-format2 <format string> (simple line formats, one field per line):
## R read
## I read identifier
## Q quality scores
## D discard field
##
## -basename <pfx> pfx.lint.gz, pfx.clean.gz pfx.report etc will be constructed
## -format-clean <format string> (output for clean reads)
## -format-lint <format string> (output for filtered reads)
## -format-clean/lint specification syntax:
## %R read
## %C clean read
## %Z clean read padded with Ns if necessary
## %V reverse complement of clean read
## %I read identifier
## %Q clean or input read quality (for clean / lint file respectively)
## %X read count (only applicable if -record-format is used)
## %Y input read quality
## %q<c> clean input read quality padded with character <c>
## %A annotation field
## %L clean read length
## %M message describing cause for filtering (lint file)
## %T trinucleotide complexity score (clean/lint file)
## %U dUst sUffix complexity information
## %3 best read/3p-adaptor alignment
## %= alignment characteristics
## mt=matchtype
## sc=suffix-complexity
## ht=head-tail-match
## nn=N-match-offset
## bb=B-match-offset
## aa=Polya-offset
## qq=Quality-trim-offset
## %n newline
## %J record offset, unique for each read. Use to match paired-end reads
## %f fastq line number based on standard fastq format
## %t tab
## %% percent sign
## Anything else is copied verbatim
## -debug [acl]+ a=alignments c=clean l=lint
## -sample c/l if debug, sample every c/l clean/lint read
## --nozip do not output gzipped files
## --noqc do not output quality report files
##
## Miscellaneous options
## --bcq-early perform early 'B' quality filtering (when reading sequences)
## --bcq-late perform late 'B' quality filtering (before outputting sequences)
## --full-length only allow reads not shortened in any filter step
## --keep-all delete rather than discard reads (e.g. tabu match, missing 5p-sinsert)
## -trim-length <int> cut reads back to length <int>
## -polya <int> remove trailing A's if length exceeds <int>
## -sc-max <f> threshold for complexity of suffix following prefix match (default 0.25)
##
## Options for use when running reaper with -geom no-bc
## -3pa <three prime adaptor>
## -5psi <five prime sequence insert>
## -tabu <tabu sequence>
The minimal parameters that we need to use are the “geometry” of the reads (do they have a barcode, and where) and the sequence of adapter(s).
The following diagram shows how reaper considers the geometry of a sequencing read (called “Message” below). There is more information in reaper’s manual.
| < - - - - - - - - - - - - Read geometry - - - - - - - - - - - - - ->
|
Geometry | <---------- 5p region ---------> Message <---- 3p region ---->
---------+----------------------------------------------------------------------
no-bc | < 5p-ad > M [ 3p-ad ]
3p-bc | < 5p-ad > M ( 3p-si barcode 3p-ad )
5p-bc | < 5p-ad > ( barcode ) [( 5p-si )] M [ 3p-ad ]
We can also specify the start of the output files that reaper will generate.
reaper -geom no-bc -3pa TGGAATTCTCGGGTGCCAAGG -i MODEK_100k.fq.gz -basename MODEK_100k## adapter TGGAATTCTCGGGTGCCA
## ---
## mRpm million reads per minute
## mNpm million nucleotides per minute
## mCps million alignment cells per second
## lint total removed reads (per 10K), sum of columns to the left
## 25K reads per dot, 1M reads per line seconds mr mRpm mNpm mCps {error qc low len NNN tabu nobc cflr cfl lint OK} per 10K
## ....
## [reaper] check 0 errors, 0 reads truncated, 100000 clean, 0 lint, 100000 total
If everything worked fine, there will be a series of files in your folder, starting with the basename you specified above:
ls MODEK_100k.*## MODEK_100k.fq.gz
## MODEK_100k.lane.clean.gz
## MODEK_100k.lane.report.clean.len
## MODEK_100k.lane.report.clean.nt
## MODEK_100k.lane.report.input.nt
## MODEK_100k.lane.report.input.q
## MODEK_100k.lint.gz
## MODEK_100k.sumstat
This file in particular has useful information:
cat MODEK_100k.lane.report.clean.len## length count
## 0 0
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
## 11 0
## 12 0
## 13 0
## 14 0
## 15 0
## 16 0
## 17 0
## 18 2647
## 19 2722
## 20 3163
## 21 12221
## 22 59150
## 23 6127
## 24 1837
## 25 1028
## 26 893
## 27 881
## 28 886
## 29 1431
## 30 654
## 31 561
## 32 1421
## 33 1079
## 34 1066
## 35 634
## 36 336
## 37 234
## 38 198
## 39 237
## 40 149
## 41 89
## 42 88
## 43 0
## 44 0
## 45 0
## 46 0
## 47 0
## 48 0
## 49 0
## 50 268
And the final output file, with the trimmed sequences, is:
gzip -dc MODEK_100k.lane.clean.gz | head## @D00261:386:CA8VYANXX:5:1312:6310:25644
## GTCTACGGCCATACCACCCTGAACGCGCCCGATCTCGT
## +
## CCCCCCGBGGGGGGGGGGGGGGGGGGGGGGG<=<EFGG
## @D00261:386:CA8VYANXX:5:1213:16997:8428
## CTAGACTGAGGCTCCTTGAGGA
## +
## BBBBBGGGGGGGGGGGGGGGGG
## @D00261:386:CA8VYANXX:5:1116:8133:95825
## TGAGGTAGTAGGTTGTATGGTT
By default, the output format is also fastq, although any output format can be obtained by using the -format-clean parameter.
For instace, to obtain a standard fasta format (discarding the quality lines), we can use the following:
reaper -geom no-bc -3pa TGGAATTCTCGGGTGCCAAGG -i MODEK_100k.fq.gz -basename MODEK_100k -format-clean '>%I%n%C%n'## adapter TGGAATTCTCGGGTGCCA
## ---
## mRpm million reads per minute
## mNpm million nucleotides per minute
## mCps million alignment cells per second
## lint total removed reads (per 10K), sum of columns to the left
## 25K reads per dot, 1M reads per line seconds mr mRpm mNpm mCps {error qc low len NNN tabu nobc cflr cfl lint OK} per 10K
## ....
## [reaper] check 0 errors, 0 reads truncated, 100000 clean, 0 lint, 100000 total
gzip -dc MODEK_100k.lane.clean.gz | head## >D00261:386:CA8VYANXX:5:1312:6310:25644
## GTCTACGGCCATACCACCCTGAACGCGCCCGATCTCGT
## >D00261:386:CA8VYANXX:5:1213:16997:8428
## CTAGACTGAGGCTCCTTGAGGA
## >D00261:386:CA8VYANXX:5:1116:8133:95825
## TGAGGTAGTAGGTTGTATGGTT
## >D00261:386:CA8VYANXX:5:2101:4546:85660
## CGCAATGAAGGTGAAGGGCC
## >D00261:386:CA8VYANXX:5:2205:12879:71398
## GCTCTCTGTCCCGCCTGGTCCTG
Biostrings package to import the clean sequences in fasta format MODEK_100k.lane.clean.gz. From these data try to generate a plot that is equivalent to the one you generated above.wget http://full_copied_address/data/sRNA_data_100k.tartar -xvf sRNA_data_100k.tarRun fastqc and reaper on each of the fastq file. If you already know how to write a script you can do so.
Modify your R program to generate the length distributions as one figure for each file.
Basically any sequencing project will include multiple samples, and it becomes tedious to have to go through each fastqc output individually. Fortunately someone has already thought of this problem, and designed a program called MultiQC. It is very easy to use, and it generates a single report from multiple fastqc reports. It can also generate combined reports for many other sequencing-related tools.
To run this program, we simply need to specify the folder that contains the individual reports, and a folder for our new output. Something like this:
multiqc folder_with_reports -o folder_for_combined_reportYou can have a look at what a MultiQC report should look like for our files, here: multiqc_report_100k.html