Linux: Mapping sRNA-Seq sequences with bowtie


Introduction

In this practical we will learn how to use bowtie to align short sequences to a genome.

Bowtie is the first version of a series of programs designed to align short sequences to genomes. These include: bowtie, bowtie2, tophat, tophat2, hisat y hisat2.

More recent programs have focused on allowing more flexibility during mapping, such as allowing an increased number of mismatches, better handling of gaps in the alignment, and properly dealing with introns and paired-end reads. If you are planning on analysing regular RNA-Seq data (sequencing mRNA) I recommend you try out hisat2, and definitely do not use bowtie.

Nevertheless, bowtie still remains one of the most practical options to align very short sequences, such as small RNAs, since we usually do not want to allow a large number of mismatches. In fact, in some cases we will be interested in focusing on perfectly-mapping reads, and bowtie is perfectly suited for this.

Bowtie and other aligners use the Burrows-Wheeler transform to compress the genome sequence, allowing even the human genome to fit on about 3-4 GB of RAM. The Wikipedia has a basic description of this algorith: Burrows-Wheeler_transform.

Any aligner, including bowtie, will require quite a bit of compute time and resources when working with a large genome such as that of the mouse. For this reason, we will be using a smaller part of the genome to make this practical more efficient.

You will need the following files:


Generating a genome index with bowtie

Remember we need at least one file with short sequences and a reference sequence (genome, transcriptome, chromosomes, etc). Make sure you already have these two files in your working directory.

Bowtie has two two programs we will be using: bowtie and bowtie-build. Using the latter we need to pre-process any reference sequence we want to use for alignments.

You can find a detailed explanation of all parameters in the manual.

To see the parameters for one of these programs, we ask for the help:

bowtie-build -h
## Usage: bowtie-build [options]* <reference_in> <ebwt_outfile_base>
##     reference_in            comma-separated list of files with ref sequences
##     ebwt_outfile_base       write Ebwt data to files with this dir/basename
## Options:
##     -f                      reference files are Fasta (default)
##     -c                      reference sequences given on cmd line (as <seq_in>)
##     --large-index           force generated index to be 'large', even if ref
##                             has fewer than 4 billion nucleotides
##     -C/--color              build a colorspace index
##     -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
##     -p/--packed             use packed strings internally; slower, uses less mem
##     --bmax <int>            max bucket sz for blockwise suffix-array builder
##     --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)
##     --dcv <int>             diff-cover period for blockwise (default: 1024)
##     --nodc                  disable diff-cover (algorithm becomes quadratic)
##     -r/--noref              don't build .3/.4.ebwt (packed reference) portion
##     -3/--justref            just build .3/.4.ebwt (packed reference) portion
##     -o/--offrate <int>      SA is sampled every 2^offRate BWT chars (default: 5)
##     -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)
##     --threads <int>         # of threads
##     --ntoa                  convert Ns in reference to As
##     --seed <int>            seed for random number generator
##     -q/--quiet              verbose output (for debugging)
##     -h/--help               print detailed description of tool and its options
##     --usage                 print this usage message
##     --version               print version information and quit

There are a few advanced options that you might need to tweak at some point. For instance, if your genome has a large fraction of repeat sequences, and you want bowtie to be faster, at the expense of more memory, you can decrease the parameter --offrate to 3 or 2. But in general, we can use the defaults.

To generate our reference index, ready for mapping:

bowtie-build Heligmosomoides_bakeri.fa.gz hbak
Settings:
  Output files: "hbak.*.ebwt"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 5 (one in 32)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  Heligmosomoides_bakeri.fa.gz

This might take a couple of minutes, depending on your computer.

Once the index is finished, we can inspect the information it contains with the command bowtie-inspect:

bowtie-inspect -s hbak
## Colorspace   0
## SA-Sample    1 in 32
## FTab-Chars   10
## Sequence-1   nHp.2.0.scaf00001   1245440
## Sequence-2   nHp.2.0.scaf00002   1224629
## Sequence-3   nHp.2.0.scaf00033   674008
## Sequence-4   nHp.2.0.scaf00040   655413
## Sequence-5   nHp.2.0.scaf00052   603233
## Sequence-6   nHp.2.0.scaf00064   564909
## Sequence-7   nHp.2.0.scaf00071   550192
## Sequence-8   nHp.2.0.scaf00121   466330
## Sequence-9   nHp.2.0.scaf00149   429856
## Sequence-10  nHp.2.0.scaf00169   411674
## Sequence-11  nHp.2.0.scaf00238   372760
## Sequence-12  nHp.2.0.scaf00397   310747
## Sequence-13  nHp.2.0.scaf00728   234490
## Sequence-14  nHp.2.0.scaf00765   228632
## Sequence-15  nHp.2.0.scaf01123   180686
## Sequence-16  nHp.2.0.scaf02062   107606
## Sequence-17  nHp.2.0.scaf02588   82772
## Sequence-18  nHp.2.0.scaf03606   50104
## Sequence-19  nHp.2.0.scaf03626   49654
## Sequence-20  nHp.2.0.scaf04380   33258

This allows us to see that we have 20 sequences, with names like nHp.2.0.scaf00052 and lengths ranging from 33,258 bases to 1,245,440. This instruction can be quite useful for genomes with many sequences (usually scaffolds or contigs) or even for measuring the length of chromosomes.


Mapping short reads with bowtie

When we have our index ready, we can use the bowtie command to map our reads. This command, similarly to other mapping programs, has many different options so we can control how the alignment will be performed, and which results we would like to save. You can simply type bowtie to see a brief description of these options. For a more complete description, you can have a look at the manual.

Here I show you some of the most important options:

bowtie
Usage:
bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | --interleaved <i> | <s>} [<hit>]

  <s>     Comma-separated list of files containing unpaired reads, or the
          sequences themselves, if -c is set.  Specify "-" for stdin.
  <hit>   File to write hits to (default: stdout)

Input:
  -q                 query input files are FASTQ .fq/.fastq (default)
  -f                 query input files are (multi-)FASTA .fa/.mfa
  -5/--trim5 <int>   trim <int> bases from 5' (left) end of reads
  -3/--trim3 <int>   trim <int> bases from 3' (right) end of reads

Alignment:
  -v <int>           report end-to-end hits w/ <=v mismatches; ignore qualities
    or
  -n/--seedmms <int> max mismatches in seed (can be 0-3, default: -n 2)
  -m <int>           suppress all alignments if > <int> exist (def: no limit)

Reporting:
  -k <int>           report up to <int> good alignments per read (default: 1)
  -a/--all           report all alignments per read (much slower than low -k)
  --best             hits guaranteed best stratum; ties broken by quality

Output:
  -t/--time          print wall-clock time taken by search phases
  --al <fname>       write aligned reads/pairs to file(s) <fname>
  --un <fname>       write unaligned reads/pairs to file(s) <fname>
  --suppress <cols>  suppresses given columns (comma-delim'ed) in default output

SAM:
  -S/--sam           write hits in SAM format

Performance:
  -p/--threads <int> number of alignment threads to launch (default: 1)

The next command will map all the sequences contained in the file Pure_EV_r1_100k.fa.gz using the index hbak and will save the results in Pure_EV_r1.bwt. The other parameters specify that we only want full-length alignments (the whole read must align, not just part of it) with at most 1 mismatch (-v 1), that we only want to save one alignment for each sequence (-k 1) and that the sequences we want to align are in fasta format (-f) since the default is fastq. Finally, we are specifying that the name of the output file is Pure_EV_r1_100k_aligned.bwt

bowtie -v 1 -k 1 -f hbak Pure_EV_r1_100k.fa.gz Pure_EV_r1_100k_aligned.bwt
## # reads processed: 100000
## # reads with at least one reported alignment: 98938 (98.94%)
## # reads that failed to align: 1062 (1.06%)
## Reported 98938 alignments

We can see that the great majority of sequences did align to our index, which could be expected since our Pure_EV libraries come from purified vesicles from H. bakeri and should ideally contain no material from other species.

Let us have a look at the alignment output:

head Pure_EV_r1_100k_aligned.bwt
## D00261:444:CBEUGANXX:3:1304:8310:28415   +   nHp.2.0.scaf00169   375533  GAACAAGGGAAGGTGTAGTTCTC IIIIIIIIIIIIIIIIIIIIIII 0   
## D00261:444:CBEUGANXX:3:2306:10972:76318  -   nHp.2.0.scaf00728   177217  AGGCGAGTTCCCGGTAATGAGTC IIIIIIIIIIIIIIIIIIIIIII 0   
## D00261:444:CBEUGANXX:3:2106:8278:5384    +   nHp.2.0.scaf00052   308667  GATCTGCACTCTGAAGATTGTC  IIIIIIIIIIIIIIIIIIIIII  1   
## D00261:444:CBEUGANXX:3:2112:4253:27475   +   nHp.2.0.scaf00071   108471  GGGTCGGACACGAACCGAGGAG  IIIIIIIIIIIIIIIIIIIIII  0   
## D00261:444:CBEUGANXX:3:2201:1709:43406   +   nHp.2.0.scaf00071   418960  GATGGTAATGCACTATGATGAGACAAG IIIIIIIIIIIIIIIIIIIIIIIIIII 0   
## D00261:444:CBEUGANXX:3:1112:5665:98050   +   nHp.2.0.scaf00033   486721  GTGATTCATGCCTCTCTGATTTATG   IIIIIIIIIIIIIIIIIIIIIIIII   0   
## D00261:444:CBEUGANXX:3:1310:19455:10530  +   nHp.2.0.scaf00033   485367  GTGTGGGGAGCTGAATGAAAAAG IIIIIIIIIIIIIIIIIIIIIII 0   
## D00261:444:CBEUGANXX:3:2103:8441:92270   +   nHp.2.0.scaf00238   127002  CTCGAACTCCGCTACCGTAAT   IIIIIIIIIIIIIIIIIIIII   0   
## D00261:444:CBEUGANXX:3:2215:6088:78554   +   nHp.2.0.scaf00033   495976  GTTGGTTATTGATTGTTGGTGTAT    IIIIIIIIIIIIIIIIIIIIIIII    0   
## D00261:444:CBEUGANXX:3:1305:9448:100465  +   nHp.2.0.scaf00002   970355  GGAGAAGACAGTCGGACGTGA   IIIIIIIIIIIIIIIIIIIII   0   

The columns are:

  1. read name
  2. strand (+/-)
  3. reference name
  4. reference position (starting at 0)
  5. read sequence (rev-comp reported when strand is ‘-’)
  6. read quality, in Phred33 (rev reported when strand is ‘-’)
  7. total number of identical positions in the reference
  8. mismatches (starting at 0 from the 5’ end)


Exercises

  • Try to align the other files to the same H. bakeri index.
  • Remember that the MODEK_ctrl libraries represent pure mouse cells, MODEK_EV are mouse cells treated with H. bakeri vesicles, and the Pure_EV libraries are from purified vesicles.
  • What would you expect to find when you align these different kinds of libraries against the H. bakeri reference? What is your result? How do you interpret this?
  • Advanced: you can download the mouse genome sequences provided at the top of this practical, and see if by using this as a reference you can better understand what is going on.


SAM/BAM format and samtools

Due to the great variety of alignment or mapping tools for short reads against genomes, there was a need to design a standard format that all tools could follow. This has the advantage that downstream tools (such as those designed to find SNPs, detect structural variants, or quantify gene expression) can be designed to expect one single input format. This is the reason why the SAM (Sequence Alignment/Map) format was designed, and samtools was developed to manipulate this format (www.htslib.org).

The SAM format is designed to be simple, lightweight and easy to process by a computer. It also allows flexibility to include a lot of information about the alignments, that other programs might require. The format is described in the following document: SAM1.pdf.

To get a better idea of how to work with this format, we will use bowtie again to repeat the previous alignment, but now saving the output in SAM format:

bowtie -S -v 1 -k 1 -f hbak Pure_EV_r1_100k.fa.gz Pure_EV_r1_100k_aligned.sam
## # reads processed: 100000
## # reads with at least one reported alignment: 98938 (98.94%)
## # reads that failed to align: 1062 (1.06%)
## Reported 98938 alignments

Let’s have a look at the SAM output:

head -n 25 Pure_EV_r1_100k_aligned.sam
## @HD  VN:1.0  SO:unsorted
## @SQ  SN:nHp.2.0.scaf00001    LN:1245440
## @SQ  SN:nHp.2.0.scaf00002    LN:1224629
## @SQ  SN:nHp.2.0.scaf00033    LN:674008
## @SQ  SN:nHp.2.0.scaf00040    LN:655413
## @SQ  SN:nHp.2.0.scaf00052    LN:603233
## @SQ  SN:nHp.2.0.scaf00064    LN:564909
## @SQ  SN:nHp.2.0.scaf00071    LN:550192
## @SQ  SN:nHp.2.0.scaf00121    LN:466330
## @SQ  SN:nHp.2.0.scaf00149    LN:429856
## @SQ  SN:nHp.2.0.scaf00169    LN:411674
## @SQ  SN:nHp.2.0.scaf00238    LN:372760
## @SQ  SN:nHp.2.0.scaf00397    LN:310747
## @SQ  SN:nHp.2.0.scaf00728    LN:234490
## @SQ  SN:nHp.2.0.scaf00765    LN:228632
## @SQ  SN:nHp.2.0.scaf01123    LN:180686
## @SQ  SN:nHp.2.0.scaf02062    LN:107606
## @SQ  SN:nHp.2.0.scaf02588    LN:82772
## @SQ  SN:nHp.2.0.scaf03606    LN:50104
## @SQ  SN:nHp.2.0.scaf03626    LN:49654
## @SQ  SN:nHp.2.0.scaf04380    LN:33258
## @PG  ID:Bowtie   VN:1.2.3    CL:"/Users/cei/src/bowtie-1.2.3-macos-x86_64/bowtie-align-s --wrapper basic-0 -S -v 1 -k 1 -f hbak Pure_EV_r1_100k.fa.gz Pure_EV_r1_100k_aligned.sam"
## D00261:444:CBEUGANXX:3:1304:8310:28415   0   nHp.2.0.scaf00169   375534  255 23M *   0   0   GAACAAGGGAAGGTGTAGTTCTC IIIIIIIIIIIIIIIIIIIIIII XA:i:0  MD:Z:23 NM:i:0  XM:i:2
## D00261:444:CBEUGANXX:3:2306:10972:76318  16  nHp.2.0.scaf00728   177218  255 23M *   0   0   AGGCGAGTTCCCGGTAATGAGTC IIIIIIIIIIIIIIIIIIIIIII XA:i:0  MD:Z:23 NM:i:0  XM:i:2
## D00261:444:CBEUGANXX:3:2106:8278:5384    0   nHp.2.0.scaf00052   308668  255 22M *   0   0   GATCTGCACTCTGAAGATTGTC  IIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:22 NM:i:0  XM:i:2

The lines that begin with a @ are the headers that include meta-data about the alignment. In this case we can see information about the version of the program, if the file is sorted, the name and length of the reference sequences, and information about how bowtie (in this case) was executed.

Each remaining line corresponds to an alignment between an input sequence and a reference. The first 11 columns are required (followed by optional columns) and contain the following information:

  1. qname - read name
  2. flag - see explain-flags
  3. rname - reference name
  4. pos - reference position (starting at 1)
  5. mapq - mapping quality, in Phred, probability of having found the correct mapping to the reference
  6. cigar - encoded representation of the alignment
  7. rnext - reference name where the other pair aligned (only for ‘paired-end’)
  8. pnext - reference position where the other pair aligned (only for ‘paired-end’)
  9. tlen - length of the full fragment (only for ‘paired-end’)
  10. seq - read sequence
  11. qual - read quality

SAM files are stored in plain text, and we can view their contents directly. But, they take up a lot of space. The equivalent binary format is called BAM, and has the advantage of being compressed and is better suited for further processing. But, we need programs like samtools to view the contents of a BAM file.

To get an idea of the kinds of operations you might want to do with samtools I am showing you a list of commands. Try them out, they should give you a general idea of what can be done with samtools.

samtools

samtools view

samtools view -b reads.sam > reads.bam

samtools sort reads.bam > reads_sort.bam

samtools view reads.bam | cut -f 3,4 | head

samtools view reads_sort.bam | cut -f 3,4 | head

samtools index reads_sort.bam

Several genome browsers can directly import alignments in sorted BAM format, so we can see where our aligned reads are piling up across the chromosomes. One example of such a browser is (IGV), which you can install and play around with.