In this practical we will learn how to use bowtie to align short sequences to a genome.
Bowtie is the first version of a series of programs designed to align short sequences to genomes. These include: bowtie, bowtie2, tophat, tophat2, hisat y hisat2.
More recent programs have focused on allowing more flexibility during mapping, such as allowing an increased number of mismatches, better handling of gaps in the alignment, and properly dealing with introns and paired-end reads. If you are planning on analysing regular RNA-Seq data (sequencing mRNA) I recommend you try out hisat2, and definitely do not use bowtie.
Nevertheless, bowtie still remains one of the most practical options to align very short sequences, such as small RNAs, since we usually do not want to allow a large number of mismatches. In fact, in some cases we will be interested in focusing on perfectly-mapping reads, and bowtie is perfectly suited for this.
Bowtie and other aligners use the Burrows-Wheeler transform to compress the genome sequence, allowing even the human genome to fit on about 3-4 GB of RAM. The Wikipedia has a basic description of this algorith: Burrows-Wheeler_transform.
Any aligner, including bowtie, will require quite a bit of compute time and resources when working with a large genome such as that of the mouse. For this reason, we will be using a smaller part of the genome to make this practical more efficient.
You will need the following files:
Remember we need at least one file with short sequences and a reference sequence (genome, transcriptome, chromosomes, etc). Make sure you already have these two files in your working directory.
Bowtie has two two programs we will be using: bowtie and bowtie-build. Using the latter we need to pre-process any reference sequence we want to use for alignments.
You can find a detailed explanation of all parameters in the manual.
To see the parameters for one of these programs, we ask for the help:
bowtie-build -h## Usage: bowtie-build [options]* <reference_in> <ebwt_outfile_base>
## reference_in comma-separated list of files with ref sequences
## ebwt_outfile_base write Ebwt data to files with this dir/basename
## Options:
## -f reference files are Fasta (default)
## -c reference sequences given on cmd line (as <seq_in>)
## --large-index force generated index to be 'large', even if ref
## has fewer than 4 billion nucleotides
## -C/--color build a colorspace index
## -a/--noauto disable automatic -p/--bmax/--dcv memory-fitting
## -p/--packed use packed strings internally; slower, uses less mem
## --bmax <int> max bucket sz for blockwise suffix-array builder
## --bmaxdivn <int> max bucket sz as divisor of ref len (default: 4)
## --dcv <int> diff-cover period for blockwise (default: 1024)
## --nodc disable diff-cover (algorithm becomes quadratic)
## -r/--noref don't build .3/.4.ebwt (packed reference) portion
## -3/--justref just build .3/.4.ebwt (packed reference) portion
## -o/--offrate <int> SA is sampled every 2^offRate BWT chars (default: 5)
## -t/--ftabchars <int> # of chars consumed in initial lookup (default: 10)
## --threads <int> # of threads
## --ntoa convert Ns in reference to As
## --seed <int> seed for random number generator
## -q/--quiet verbose output (for debugging)
## -h/--help print detailed description of tool and its options
## --usage print this usage message
## --version print version information and quit
There are a few advanced options that you might need to tweak at some point. For instance, if your genome has a large fraction of repeat sequences, and you want bowtie to be faster, at the expense of more memory, you can decrease the parameter --offrate to 3 or 2. But in general, we can use the defaults.
To generate our reference index, ready for mapping:
bowtie-build Heligmosomoides_bakeri.fa.gz hbakSettings:
Output files: "hbak.*.ebwt"
Line rate: 6 (line is 64 bytes)
Lines per side: 1 (side is 64 bytes)
Offset rate: 5 (one in 32)
FTable chars: 10
Strings: unpacked
Max bucket size: default
Max bucket size, sqrt multiplier: default
Max bucket size, len divisor: 4
Difference-cover sample period: 1024
Endianness: little
Actual local endianness: little
Sanity checking: disabled
Assertions: disabled
Random seed: 0
Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
Heligmosomoides_bakeri.fa.gz
This might take a couple of minutes, depending on your computer.
Once the index is finished, we can inspect the information it contains with the command bowtie-inspect:
bowtie-inspect -s hbak## Colorspace 0
## SA-Sample 1 in 32
## FTab-Chars 10
## Sequence-1 nHp.2.0.scaf00001 1245440
## Sequence-2 nHp.2.0.scaf00002 1224629
## Sequence-3 nHp.2.0.scaf00033 674008
## Sequence-4 nHp.2.0.scaf00040 655413
## Sequence-5 nHp.2.0.scaf00052 603233
## Sequence-6 nHp.2.0.scaf00064 564909
## Sequence-7 nHp.2.0.scaf00071 550192
## Sequence-8 nHp.2.0.scaf00121 466330
## Sequence-9 nHp.2.0.scaf00149 429856
## Sequence-10 nHp.2.0.scaf00169 411674
## Sequence-11 nHp.2.0.scaf00238 372760
## Sequence-12 nHp.2.0.scaf00397 310747
## Sequence-13 nHp.2.0.scaf00728 234490
## Sequence-14 nHp.2.0.scaf00765 228632
## Sequence-15 nHp.2.0.scaf01123 180686
## Sequence-16 nHp.2.0.scaf02062 107606
## Sequence-17 nHp.2.0.scaf02588 82772
## Sequence-18 nHp.2.0.scaf03606 50104
## Sequence-19 nHp.2.0.scaf03626 49654
## Sequence-20 nHp.2.0.scaf04380 33258
This allows us to see that we have 20 sequences, with names like nHp.2.0.scaf00052 and lengths ranging from 33,258 bases to 1,245,440. This instruction can be quite useful for genomes with many sequences (usually scaffolds or contigs) or even for measuring the length of chromosomes.
When we have our index ready, we can use the bowtie command to map our reads. This command, similarly to other mapping programs, has many different options so we can control how the alignment will be performed, and which results we would like to save. You can simply type bowtie to see a brief description of these options. For a more complete description, you can have a look at the manual.
Here I show you some of the most important options:
bowtieUsage:
bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | --interleaved <i> | <s>} [<hit>]
<s> Comma-separated list of files containing unpaired reads, or the
sequences themselves, if -c is set. Specify "-" for stdin.
<hit> File to write hits to (default: stdout)
Input:
-q query input files are FASTQ .fq/.fastq (default)
-f query input files are (multi-)FASTA .fa/.mfa
-5/--trim5 <int> trim <int> bases from 5' (left) end of reads
-3/--trim3 <int> trim <int> bases from 3' (right) end of reads
Alignment:
-v <int> report end-to-end hits w/ <=v mismatches; ignore qualities
or
-n/--seedmms <int> max mismatches in seed (can be 0-3, default: -n 2)
-m <int> suppress all alignments if > <int> exist (def: no limit)
Reporting:
-k <int> report up to <int> good alignments per read (default: 1)
-a/--all report all alignments per read (much slower than low -k)
--best hits guaranteed best stratum; ties broken by quality
Output:
-t/--time print wall-clock time taken by search phases
--al <fname> write aligned reads/pairs to file(s) <fname>
--un <fname> write unaligned reads/pairs to file(s) <fname>
--suppress <cols> suppresses given columns (comma-delim'ed) in default output
SAM:
-S/--sam write hits in SAM format
Performance:
-p/--threads <int> number of alignment threads to launch (default: 1)
The next command will map all the sequences contained in the file Pure_EV_r1_100k.fa.gz using the index hbak and will save the results in Pure_EV_r1.bwt. The other parameters specify that we only want full-length alignments (the whole read must align, not just part of it) with at most 1 mismatch (-v 1), that we only want to save one alignment for each sequence (-k 1) and that the sequences we want to align are in fasta format (-f) since the default is fastq. Finally, we are specifying that the name of the output file is Pure_EV_r1_100k_aligned.bwt
bowtie -v 1 -k 1 -f hbak Pure_EV_r1_100k.fa.gz Pure_EV_r1_100k_aligned.bwt## # reads processed: 100000
## # reads with at least one reported alignment: 98938 (98.94%)
## # reads that failed to align: 1062 (1.06%)
## Reported 98938 alignments
We can see that the great majority of sequences did align to our index, which could be expected since our Pure_EV libraries come from purified vesicles from H. bakeri and should ideally contain no material from other species.
Let us have a look at the alignment output:
head Pure_EV_r1_100k_aligned.bwt## D00261:444:CBEUGANXX:3:1304:8310:28415 + nHp.2.0.scaf00169 375533 GAACAAGGGAAGGTGTAGTTCTC IIIIIIIIIIIIIIIIIIIIIII 0
## D00261:444:CBEUGANXX:3:2306:10972:76318 - nHp.2.0.scaf00728 177217 AGGCGAGTTCCCGGTAATGAGTC IIIIIIIIIIIIIIIIIIIIIII 0
## D00261:444:CBEUGANXX:3:2106:8278:5384 + nHp.2.0.scaf00052 308667 GATCTGCACTCTGAAGATTGTC IIIIIIIIIIIIIIIIIIIIII 1
## D00261:444:CBEUGANXX:3:2112:4253:27475 + nHp.2.0.scaf00071 108471 GGGTCGGACACGAACCGAGGAG IIIIIIIIIIIIIIIIIIIIII 0
## D00261:444:CBEUGANXX:3:2201:1709:43406 + nHp.2.0.scaf00071 418960 GATGGTAATGCACTATGATGAGACAAG IIIIIIIIIIIIIIIIIIIIIIIIIII 0
## D00261:444:CBEUGANXX:3:1112:5665:98050 + nHp.2.0.scaf00033 486721 GTGATTCATGCCTCTCTGATTTATG IIIIIIIIIIIIIIIIIIIIIIIII 0
## D00261:444:CBEUGANXX:3:1310:19455:10530 + nHp.2.0.scaf00033 485367 GTGTGGGGAGCTGAATGAAAAAG IIIIIIIIIIIIIIIIIIIIIII 0
## D00261:444:CBEUGANXX:3:2103:8441:92270 + nHp.2.0.scaf00238 127002 CTCGAACTCCGCTACCGTAAT IIIIIIIIIIIIIIIIIIIII 0
## D00261:444:CBEUGANXX:3:2215:6088:78554 + nHp.2.0.scaf00033 495976 GTTGGTTATTGATTGTTGGTGTAT IIIIIIIIIIIIIIIIIIIIIIII 0
## D00261:444:CBEUGANXX:3:1305:9448:100465 + nHp.2.0.scaf00002 970355 GGAGAAGACAGTCGGACGTGA IIIIIIIIIIIIIIIIIIIII 0
The columns are:
MODEK_ctrl libraries represent pure mouse cells, MODEK_EV are mouse cells treated with H. bakeri vesicles, and the Pure_EV libraries are from purified vesicles.Due to the great variety of alignment or mapping tools for short reads against genomes, there was a need to design a standard format that all tools could follow. This has the advantage that downstream tools (such as those designed to find SNPs, detect structural variants, or quantify gene expression) can be designed to expect one single input format. This is the reason why the SAM (Sequence Alignment/Map) format was designed, and samtools was developed to manipulate this format (www.htslib.org).
The SAM format is designed to be simple, lightweight and easy to process by a computer. It also allows flexibility to include a lot of information about the alignments, that other programs might require. The format is described in the following document: SAM1.pdf.
To get a better idea of how to work with this format, we will use bowtie again to repeat the previous alignment, but now saving the output in SAM format:
bowtie -S -v 1 -k 1 -f hbak Pure_EV_r1_100k.fa.gz Pure_EV_r1_100k_aligned.sam## # reads processed: 100000
## # reads with at least one reported alignment: 98938 (98.94%)
## # reads that failed to align: 1062 (1.06%)
## Reported 98938 alignments
Let’s have a look at the SAM output:
head -n 25 Pure_EV_r1_100k_aligned.sam## @HD VN:1.0 SO:unsorted
## @SQ SN:nHp.2.0.scaf00001 LN:1245440
## @SQ SN:nHp.2.0.scaf00002 LN:1224629
## @SQ SN:nHp.2.0.scaf00033 LN:674008
## @SQ SN:nHp.2.0.scaf00040 LN:655413
## @SQ SN:nHp.2.0.scaf00052 LN:603233
## @SQ SN:nHp.2.0.scaf00064 LN:564909
## @SQ SN:nHp.2.0.scaf00071 LN:550192
## @SQ SN:nHp.2.0.scaf00121 LN:466330
## @SQ SN:nHp.2.0.scaf00149 LN:429856
## @SQ SN:nHp.2.0.scaf00169 LN:411674
## @SQ SN:nHp.2.0.scaf00238 LN:372760
## @SQ SN:nHp.2.0.scaf00397 LN:310747
## @SQ SN:nHp.2.0.scaf00728 LN:234490
## @SQ SN:nHp.2.0.scaf00765 LN:228632
## @SQ SN:nHp.2.0.scaf01123 LN:180686
## @SQ SN:nHp.2.0.scaf02062 LN:107606
## @SQ SN:nHp.2.0.scaf02588 LN:82772
## @SQ SN:nHp.2.0.scaf03606 LN:50104
## @SQ SN:nHp.2.0.scaf03626 LN:49654
## @SQ SN:nHp.2.0.scaf04380 LN:33258
## @PG ID:Bowtie VN:1.2.3 CL:"/Users/cei/src/bowtie-1.2.3-macos-x86_64/bowtie-align-s --wrapper basic-0 -S -v 1 -k 1 -f hbak Pure_EV_r1_100k.fa.gz Pure_EV_r1_100k_aligned.sam"
## D00261:444:CBEUGANXX:3:1304:8310:28415 0 nHp.2.0.scaf00169 375534 255 23M * 0 0 GAACAAGGGAAGGTGTAGTTCTC IIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:23 NM:i:0 XM:i:2
## D00261:444:CBEUGANXX:3:2306:10972:76318 16 nHp.2.0.scaf00728 177218 255 23M * 0 0 AGGCGAGTTCCCGGTAATGAGTC IIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:23 NM:i:0 XM:i:2
## D00261:444:CBEUGANXX:3:2106:8278:5384 0 nHp.2.0.scaf00052 308668 255 22M * 0 0 GATCTGCACTCTGAAGATTGTC IIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:22 NM:i:0 XM:i:2
The lines that begin with a @ are the headers that include meta-data about the alignment. In this case we can see information about the version of the program, if the file is sorted, the name and length of the reference sequences, and information about how bowtie (in this case) was executed.
Each remaining line corresponds to an alignment between an input sequence and a reference. The first 11 columns are required (followed by optional columns) and contain the following information:
SAM files are stored in plain text, and we can view their contents directly. But, they take up a lot of space. The equivalent binary format is called BAM, and has the advantage of being compressed and is better suited for further processing. But, we need programs like samtools to view the contents of a BAM file.
To get an idea of the kinds of operations you might want to do with samtools I am showing you a list of commands. Try them out, they should give you a general idea of what can be done with samtools.
samtools
samtools view
samtools view -b reads.sam > reads.bam
samtools sort reads.bam > reads_sort.bam
samtools view reads.bam | cut -f 3,4 | head
samtools view reads_sort.bam | cut -f 3,4 | head
samtools index reads_sort.bamSeveral genome browsers can directly import alignments in sorted BAM format, so we can see where our aligned reads are piling up across the chromosomes. One example of such a browser is (IGV), which you can install and play around with.