Linux: Mapping sRNA-Seq sequences with bowtie


Introduction

In this practical we will learn how to use bowtie to align short sequences to a genome.

Bowtie is the first version of a series of programs designed to align short sequences to genomes. These include: bowtie, bowtie2, tophat, tophat2, hisat y hisat2.

More recent programs have focused on allowing more flexibility during mapping, such as allowing an increased number of mismatches, better handling of gaps in the alignment, and properly dealing with introns and paired-end reads. If you are planning on analysing regular RNA-Seq data (sequencing mRNA) I recommend you try out hisat2, and definitely do not use bowtie.

Nevertheless, bowtie still remains one of the most practical options to align very short sequences, such as small RNAs, since we usually do not want to allow a large number of mismatches. In fact, in some cases we will be interested in focusing on perfectly-mapping reads, and bowtie is perfectly suited for this.

Bowtie and other aligners use the Burrows-Wheeler transform to compress the genome sequence, allowing even the human genome to fit on about 3-4 GB of RAM. The Wikipedia has a basic description of this algorith: Burrows-Wheeler_transform.

Any aligner, including bowtie, will require quite a bit of compute time and resources when working with a large genome. For this reason, we will be using a smaller part of the genome to make this practical more efficient. The file I have prepared contains the first 10,000,000 nt of the first three chromosomes of Heligmosomoides bakeri. In addition, I have included sequences from NCBI’s UniVec Database, which is a common resource to identify potential contaminants resulting from vectors, linkers and primers commonly used during cloning experiments.

You will need to download the following two files:

These last files are essentially the same ones as you produced at the end of the last practical. Remember that to extract all the files from a tar file you need to execute: tar -xvf sRNA_cutadapt_100k.tar.


Generating a genome index with bowtie

Remember we need at least one file with short sequences and a reference sequence (genome, transcriptome, chromosomes, etc). Make sure you already have these two files in your working directory.

Bowtie has two programs we will be using: bowtie and bowtie-build. We first need to use bowtie-build to pre-process any reference sequence we want to use for alignments.

You can find a detailed explanation of all parameters in the manual.

To see the parameters for one of these programs, we ask for the help:

bowtie-build -h
## Usage: bowtie-build [options]* <reference_in> <ebwt_outfile_base>
##     reference_in            comma-separated list of files with ref sequences
##     ebwt_outfile_base       write Ebwt data to files with this dir/basename
## Options:
##     -f                      reference files are Fasta (default)
##     -c                      reference sequences given on cmd line (as <seq_in>)
##     --large-index           force generated index to be 'large', even if ref
##                             has fewer than 4 billion nucleotides
##     -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
##     -p/--packed             use packed strings internally; slower, uses less mem
##     --bmax <int>            max bucket sz for blockwise suffix-array builder
##     --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)
##     --dcv <int>             diff-cover period for blockwise (default: 1024)
##     --nodc                  disable diff-cover (algorithm becomes quadratic)
##     -r/--noref              don't build .3/.4.ebwt (packed reference) portion
##     -3/--justref            just build .3/.4.ebwt (packed reference) portion
##     -o/--offrate <int>      SA is sampled every 2^offRate BWT chars (default: 5)
##     -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)
##     --threads <int>         # of threads
##     --ntoa                  convert Ns in reference to As
##     --seed <int>            seed for random number generator
##     -q/--quiet              verbose output (for debugging)
##     -h/--help               print detailed description of tool and its options
##     --usage                 print this usage message
##     --version               print version information and quit

There are a few advanced options that you might need to tweak at some point. For instance, if your genome has a large fraction of repeat sequences, and you want bowtie to be faster, at the expense of more memory, you can decrease the parameter --offrate to 3 or 2. But in general, we can use the defaults.

To generate our reference index, ready for mapping:

bowtie-build Hbakeri_small_UniVec.fa.gz hbak
Settings:
  Output files: "hbak.*.ebwt"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 5 (one in 32)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  Hbakeri_small_UniVec.fa.gz

This might take a couple of minutes, depending on your computer.

Once the index is finished, we can inspect the information it contains with the command bowtie-inspect:

bowtie-inspect -s hbak | head

bowtie-inspect -s hbak | tail
## SA-Sample    1 in 32
## FTab-Chars   10
## Sequence-1   I   10000000
## Sequence-2   II  10000000
## Sequence-3   III 10000000
## Sequence-4   univec|X66730.1:1-2687-49 B.bronchiseptica plasmid pBBR1 genes for mobilization and replication 2736
## Sequence-5   univec|J01749.1:1-4361-49 Cloning vector pBR322 4410
## Sequence-6   univec|J02400.1:1-5243-49 Simian virus 40   5292
## Sequence-7   univec|J02482.1:1-5386-49 Coliphage phi-X174    5435
## Sequence-8   univec|V00604.2:1-6407-49 Phage M13 genome  6456
## Sequence-398 univec|AF310136.1:4127-5198 Plasposon NKBOR 1072
## Sequence-399 univec|L19900.1:1-667 Cloning vector cosmid svPHEP DNA sequence encoding beta-lactamase and HSV thymidine kinase genes  667
## Sequence-400 univec|L19900.1:803-1010 Cloning vector cosmid svPHEP DNA sequence encoding beta-lactamase and HSV thymidine kinase genes   208
## Sequence-401 univec|L19900.1:2493-3111 Cloning vector cosmid svPHEP DNA sequence encoding beta-lactamase and HSV thymidine kinase genes  619
## Sequence-402 univec|M99566.1:573-1230 sCos cloning vector SfiI containing bacteriophage promoters and flanking restriction sites in sCos vectors 658
## Sequence-403 univec|M99566.1:1654-2277 sCos cloning vector SfiI containing bacteriophage promoters and flanking restriction sites in sCos vectors    624
## Sequence-404 univec|D16587.1:1-258 Unidentified cloning vector pNTT104 DNA, vector for E. coli, polylinker sequence  258
## Sequence-405 univec|NGB00603.1:1-319 Cloning vector pET-28a cloning and expression region    319
## Sequence-406 univec|NGB00075.1:11-234 pFLC-I polylinker for RH clones    224
## Sequence-407 univec|NGB00153.1:1-234 pUCm-T multiple cloning site    234

This allows us to see that we have 407 sequences. The first three sequences are much longer (10,000,000 nt) and are named I, II, III. These are the first part of the H. bakeri chromosomes. The rest are the UniVec sequences I mentioned above.

The bowtie-inspect command can be useful to list the contents of genomes with many sequences (usually scaffolds or contigs) or even for measuring the length of chromosomes.


Mapping short reads with bowtie

When we have our index ready, we can use the bowtie command to map our reads. This command, similarly to other mapping programs, has many different options so we can control how the alignment will be performed, and which results we would like to save. You can simply type bowtie to see a brief description of these options. For a more complete description, you can have a look at the manual.

Here I show you some of the most important options:

bowtie
No index, query, or output file specified!
Usage: 
bowtie [options]* -x <ebwt> {-1 <m1> -2 <m2> | --12 <r> | --interleaved <i> | <s>} [<hit>]

  <ebwt>  Index filename prefix (minus trailing .X.ebwt).
  <m1>    Comma-separated list of files containing upstream mates (or the
          sequences themselves, if -c is set) paired with mates in <m2>
  <m2>    Comma-separated list of files containing downstream mates (or the
          sequences themselves if -c is set) paired with mates in <m1>
  <r>     Comma-separated list of files containing Crossbow-style reads.  Can be
          a mixture of paired and unpaired.  Specify "-" for stdin.
  <i>     Files with interleaved paired-end FASTQ reads.
  <s>     Comma-separated list of files containing unpaired reads, or the
          sequences themselves, if -c is set.  Specify "-" for stdin.
  <hit>   File to write hits to (default: stdout)
          
Input:
  -q                 query input files are FASTQ .fq/.fastq (default)
  -f                 query input files are (multi-)FASTA .fa/.mfa
  -5/--trim5 <int>   trim <int> bases from 5' (left) end of reads
  -3/--trim3 <int>   trim <int> bases from 3' (right) end of reads

Alignment:
  -v <int>           report end-to-end hits w/ <=v mismatches; ignore qualities
    or
  -n/--seedmms <int> max mismatches in seed (can be 0-3, default: -n 2)
  -m <int>           suppress all alignments if > <int> exist (def: no limit)

Reporting:
  -k <int>           report up to <int> good alignments per read (default: 1)
  -a/--all           report all alignments per read (much slower than low -k)
  --best             hits guaranteed best stratum; ties broken by quality

Output:
  -t/--time          print wall-clock time taken by search phases
  --al <fname>       write aligned reads/pairs to file(s) <fname>
  --un <fname>       write unaligned reads/pairs to file(s) <fname>
  --suppress <cols>  suppresses given columns (comma-delim'ed) in default output

SAM:
  -S/--sam           write hits in SAM format

Performance:
  -p/--threads <int> number of alignment threads to launch (default: 1)

The next command will use the index (-x) hbak to map all the sequences contained in the file Pure_EV_r1_100k.trimmed.fa.gz and will save the alignment results in Pure_EV_r1.aligned.bwt. The other parameters specify that we only want full-length alignments (the whole read must align, not just part of it) with at most 1 mismatch (-v 1), that we only want to save one alignment for each sequence (-k 1) and that the sequences we want to align are in fasta format (-f) since the default is fastq.

bowtie -v 1 -k 1 -f -x hbak Pure_EV_r1_100k.trimmed.fa.gz Pure_EV_r1_100k.aligned.bwt
## # reads processed: 99975
## # reads with at least one alignment: 90810 (90.83%)
## # reads that failed to align: 9165 (9.17%)
## Reported 90810 alignments

We can see that the great majority of sequences did align to our index, which could be expected since our Pure_EV libraries come from purified vesicles from H. bakeri and should ideally contain no material from other species.

Let us have a look at the alignment output:

head Pure_EV_r1_100k.aligned.bwt
## D00261:444:CBEUGANXX:3:1105:9313:2248 1:N:0:AGTCAA   -   II  2384331 TTTACGACCACGACGCTTCCAC  IIIIIIIIIIIIIIIIIIIIII  0   16:A>G
## D00261:444:CBEUGANXX:3:1208:12895:95475 1:N:0:AGTCAA -   I   553644  ACCCGACAGAGTCACATATCTC  IIIIIIIIIIIIIIIIIIIIII  0   
## D00261:444:CBEUGANXX:3:1203:1482:96832 1:N:0:AGTCAA  -   I   464568  CCAGCTATTCCAATCCCAACCTTC    IIIIIIIIIIIIIIIIIIIIIIII    0   
## D00261:444:CBEUGANXX:3:2314:8667:62576 1:N:0:AGTCAA  +   I   9467610 GTACCGACTTGCAAAGAATCGCC IIIIIIIIIIIIIIIIIIIIIII 9   
## D00261:444:CBEUGANXX:3:1203:14699:72177 1:N:0:AGTCAA +   I   5014505 GGCGATCGTGGATTTGGCGG    IIIIIIIIIIIIIIIIIIII    0   
## D00261:444:CBEUGANXX:3:2315:17872:58338 1:N:0:AGTCAA +   II  5699454 GTACTTTGAAGATCGCTCCTCCGA    IIIIIIIIIIIIIIIIIIIIIIII    2   
## D00261:444:CBEUGANXX:3:1105:14256:2373 1:N:0:AGTCAA  +   I   9467128 GAAGATCGTGCCGTATGAGACGG IIIIIIIIIIIIIIIIIIIIIII 14  
## D00261:444:CBEUGANXX:3:2312:3892:98535 1:N:0:AGTCAA  -   II  5328543 CTACACTTCAGCCGCCGGG IIIIIIIIIIIIIIIIIII 0   2:T>G
## D00261:444:CBEUGANXX:3:1109:1994:86472 1:N:0:AGTCAA  -   I   554745  TCGTTCCAAGCAATACCACAAAC IIIIIIIIIIIIIIIIIIIIIII 0   
## D00261:444:CBEUGANXX:3:1105:20944:2331 1:N:0:AGTCAA  +   II  4221840 GGTATCAGACGTTGTGGAGGG   IIIIIIIIIIIIIIIIIIIII   3   

The columns are:

  1. read name
  2. strand (+/-)
  3. reference name
  4. reference position (starting at 0)
  5. read sequence (rev-comp reported when strand is ‘-’)
  6. read quality, in Phred33 (rev reported when strand is ‘-’)
  7. total number of identical positions in the reference
  8. mismatches (starting at 0 from the 5’ end)


Knowing a bit of Linux allows us to explore some more questions, taking advantage of this simple output format.

  • How many reads aligned?
wc -l Pure_EV_r1_100k.aligned.bwt
##    90810 Pure_EV_r1_100k.aligned.bwt
  • How many to each chromosome (or UniVec sequence)?
cut -f 3 Pure_EV_r1_100k.aligned.bwt | sort | uniq -c
## 51414 I
## 19807 II
## 19587 III
##    2 univec|AF327712.1:16794-18581

So there are some reads that appear to be contaminants! In this case the number is probably too small to cause any problems in our analyses.

Exercises

  • Try to align other files to the same H. bakeri index.
  • Remember that the MODEK_ctrl libraries represent pure mouse cells, MODEK_EV are mouse cells treated with H. bakeri vesicles, and the Pure_EV libraries are from purified vesicles.
  • What would you expect to find when you align these different kinds of libraries against the H. bakeri reference? What is your result? How do you interpret this?


SAM format and samtools

Due to the great variety of alignment or mapping tools for short reads against genomes, there was a need to design a standard format that all tools could follow. This has the advantage that downstream tools (such as those designed to find SNPs, detect structural variants, or quantify gene expression) can be designed to expect one single input format. This is the reason why the SAM (Sequence Alignment/Map) format was designed, and samtools was developed to manipulate this format (www.htslib.org).

The SAM format is designed to be simple, lightweight and easy to process by a computer. It also allows flexibility to include a lot of information about the alignments, that other programs might require. The format is described in the following document: SAM1.pdf.

To get a better idea of how to work with this format, we will use bowtie again to repeat the previous alignment, but now saving the output in SAM format (-S):

bowtie -S -v 1 -k 1 -f -x hbak Pure_EV_r1_100k.trimmed.fa.gz Pure_EV_r1_100k.aligned.sam
## # reads processed: 99975
## # reads with at least one alignment: 90810 (90.83%)
## # reads that failed to align: 9165 (9.17%)
## Reported 90810 alignments

Let’s have a look at the SAM output:

head -n 10 Pure_EV_r1_100k.aligned.sam
## @HD  VN:1.0  SO:unsorted
## @SQ  SN:I    LN:10000000
## @SQ  SN:II   LN:10000000
## @SQ  SN:III  LN:10000000
## @SQ  SN:univec|X66730.1:1-2687-49    LN:2736
## @SQ  SN:univec|J01749.1:1-4361-49    LN:4410
## @SQ  SN:univec|J02400.1:1-5243-49    LN:5292
## @SQ  SN:univec|J02482.1:1-5386-49    LN:5435
## @SQ  SN:univec|V00604.2:1-6407-49    LN:6456
## @SQ  SN:univec|J02459.1:1-48502-49   LN:48551

The lines that begin with a @ are the headers that include meta-data about the alignment. In this case we can see information about the version of the program, if the file is sorted, the name and length of the reference sequences.

tail -n 10 Pure_EV_r1_100k.aligned.sam
## D00261:444:CBEUGANXX:3:2303:12416:92715  0   I   7373145 255 26M *   0   0   TCCGTTGTGGTCTAGTGGTTAGGATT  IIIIIIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:26 NM:i:0  XM:i:1
## D00261:444:CBEUGANXX:3:2214:4626:10034   16  I   553550  255 23M *   0   0   TCTTCTCCGATCGAAGGATGTGC IIIIIIIIIIIIIIIIIIIIIII XA:i:0  MD:Z:23 NM:i:0  XM:i:1
## D00261:444:CBEUGANXX:3:2312:3644:36731   0   II  7491950 255 23M *   0   0   GTACCTTGGATCAGCAGTCGCGT IIIIIIIIIIIIIIIIIIIIIII XA:i:1  MD:Z:6G16   NM:i:1  XM:i:1
## D00261:444:CBEUGANXX:3:2102:10304:18281  0   III 1836230 255 23M *   0   0   GATACTCTGACAGTCGAAGCACC IIIIIIIIIIIIIIIIIIIIIII XA:i:1  MD:Z:3T19   NM:i:1  XM:i:1
## D00261:444:CBEUGANXX:3:1308:19487:56966  0   I   9228741 255 23M *   0   0   GTCACAGACGCGAAGATCGTGCC IIIIIIIIIIIIIIIIIIIIIII XA:i:0  MD:Z:23 NM:i:0  XM:i:1
## D00261:444:CBEUGANXX:3:2109:11156:62889  0   I   2590445 255 23M *   0   0   GAAGATCGTGCCGTATGAGACGG IIIIIIIIIIIIIIIIIIIIIII XA:i:0  MD:Z:23 NM:i:0  XM:i:1
## D00261:444:CBEUGANXX:3:2303:6343:92983   16  I   556397  255 30M *   0   0   CGTCCAACAGTCCGTGCTCGGTAAAGGAAT  IIIIIIIIIIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:30 NM:i:0  XM:i:1
## D00261:444:CBEUGANXX:3:1209:9868:35563   0   I   6512832 255 21M *   0   0   GCTCAAAGATCTTCATGCTGT   IIIIIIIIIIIIIIIIIIIII   XA:i:0  MD:Z:21 NM:i:0  XM:i:1
## D00261:444:CBEUGANXX:3:2214:17713:20815  4   *   0   0   *   *   0   0   GTTCACTAATCCGCGGAGACTCTT    IIIIIIIIIIIIIIIIIIIIIIII    XM:i:0
## D00261:444:CBEUGANXX:3:2210:17753:50480  0   I   5012294 255 26M *   0   0   GTTCTAATTCGAATGTAGGGTTCTTT  IIIIIIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:26 NM:i:0  XM:i:1

Each remaining line (we’re just looking at the last 10 here) corresponds to an alignment between an input sequence and a reference. The first 11 columns are required (followed by optional columns) and contain the following information:

  1. qname - read name
  2. flag - see explain-flags website, inserted below
  3. rname - reference name
  4. pos - reference position (starting at 1)
  5. mapq - mapping quality, in Phred, probability of having found the correct mapping to the reference
  6. cigar - encoded representation of the alignment
  7. rnext - reference name where the other pair aligned (only for ‘paired-end’)
  8. pnext - reference position where the other pair aligned (only for ‘paired-end’)
  9. tlen - length of the full fragment (only for ‘paired-end’)
  10. seq - read sequence
  11. qual - read quality

Let’s have a look at all the flags that were found in our sam file:

grep -v '^@' Pure_EV_r1_100k.aligned.sam | cut -f 2 | sort | uniq
## 0
## 16
## 4
  • Can you paste these values into the explain-flags website above to find out what they mean?


The binary version of SAM: BAM

SAM files are stored in plain text, and we can view their contents directly. But, they take up a lot of space. The equivalent binary format is called BAM, and has the advantage of being compressed and is better suited for further processing. But, we need programs like samtools to view the contents of a BAM file.

The program samtools has a series of internal commands, each allowing a variety of processes on SAM (or BAM) files.

samtools

Typing samtools shows us the available commands, and choosing one gives us the options available.

## 
## Program: samtools (Tools for alignments in the SAM format)
## Version: 1.21 (using htslib 1.21)
## 
## Usage:   samtools <command> [options]
## 
## Commands:
##   -- Indexing
##      dict           create a sequence dictionary file
##      faidx          index/extract FASTA
##      fqidx          index/extract FASTQ
##      index          index alignment
## 
##   -- Editing
##      calmd          recalculate MD/NM tags and '=' bases
##      fixmate        fix mate information
##      reheader       replace BAM header
##      targetcut      cut fosmid regions (for fosmid pool only)
##      addreplacerg   adds or replaces RG tags
##      markdup        mark duplicates
##      ampliconclip   clip oligos from the end of reads
## 
##   -- File operations
##      collate        shuffle and group alignments by name
##      cat            concatenate BAMs
##      consensus      produce a consensus Pileup/FASTA/FASTQ
##      merge          merge sorted alignments
##      mpileup        multi-way pileup
##      sort           sort alignment file
##      split          splits a file by read group
##      quickcheck     quickly check if SAM/BAM/CRAM file appears intact
##      fastq          converts a BAM to a FASTQ
##      fasta          converts a BAM to a FASTA
##      import         Converts FASTA or FASTQ files to SAM/BAM/CRAM
##      reference      Generates a reference from aligned data
##      reset          Reverts aligner changes in reads
## 
##   -- Statistics
##      bedcov         read depth per BED region
##      coverage       alignment depth and percent coverage
##      depth          compute the depth
##      flagstat       simple stats
##      idxstats       BAM index stats
##      cram-size      list CRAM Content-ID and Data-Series sizes
##      phase          phase heterozygotes
##      stats          generate stats (former bamcheck)
##      ampliconstats  generate amplicon specific stats
## 
##   -- Viewing
##      flags          explain BAM flags
##      head           header viewer
##      tview          text alignment viewer
##      view           SAM<->BAM<->CRAM conversion
##      depad          convert padded BAM to unpadded BAM
##      samples        list the samples in a set of SAM/BAM/CRAM files
## 
##   -- Misc
##      help [cmd]     display this help message or help for [cmd]
##      version        detailed version information

To start out, let’s look at one of the most widely used options, viewing the contents of a file:

samtools view Pure_EV_r1_100k.aligned.sam | head
## D00261:444:CBEUGANXX:3:1306:11355:85767  4   *   0   0   *   *   0   0   TTCTGCCCAGTGCTCTGAATGTCAAAG IIIIIIIIIIIIIIIIIIIIIIIIIII XM:i:0
## D00261:444:CBEUGANXX:3:1105:9313:2248    16  II  2384332 255 22M *   0   0   TTTACGACCACGACGCTTCCAC  IIIIIIIIIIIIIIIIIIIIII  XA:i:1  MD:Z:5A16   NM:i:1  XM:i:1
## D00261:444:CBEUGANXX:3:1208:12895:95475  16  I   553645  255 22M *   0   0   ACCCGACAGAGTCACATATCTC  IIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:22 NM:i:0  XM:i:1
## D00261:444:CBEUGANXX:3:1203:1482:96832   16  I   464569  255 24M *   0   0   CCAGCTATTCCAATCCCAACCTTC    IIIIIIIIIIIIIIIIIIIIIIII    XA:i:0  MD:Z:24 NM:i:0  XM:i:1
## D00261:444:CBEUGANXX:3:1214:9191:67596   4   *   0   0   *   *   0   0   AGAAGACCCTGTTGAGCTTGACTC    IIIIIIIIIIIIIIIIIIIIIIII    XM:i:0
## D00261:444:CBEUGANXX:3:2314:8667:62576   0   I   9467611 255 23M *   0   0   GTACCGACTTGCAAAGAATCGCC IIIIIIIIIIIIIIIIIIIIIII XA:i:0  MD:Z:23 NM:i:0  XM:i:1
## D00261:444:CBEUGANXX:3:1203:14699:72177  0   I   5014506 255 20M *   0   0   GGCGATCGTGGATTTGGCGG    IIIIIIIIIIIIIIIIIIII    XA:i:0  MD:Z:20 NM:i:0  XM:i:1
## D00261:444:CBEUGANXX:3:2315:17872:58338  0   II  5699455 255 24M *   0   0   GTACTTTGAAGATCGCTCCTCCGA    IIIIIIIIIIIIIIIIIIIIIIII    XA:i:0  MD:Z:24 NM:i:0  XM:i:1
## D00261:444:CBEUGANXX:3:1105:14256:2373   0   I   9467129 255 23M *   0   0   GAAGATCGTGCCGTATGAGACGG IIIIIIIIIIIIIIIIIIIIIII XA:i:0  MD:Z:23 NM:i:0  XM:i:1
## D00261:444:CBEUGANXX:3:2312:3892:98535   16  II  5328544 255 19M *   0   0   CTACACTTCAGCCGCCGGG IIIIIIIIIIIIIIIIIII XA:i:1  MD:Z:16T2   NM:i:1  XM:i:1
  • The last command is sightly different than simply using the standard Linux command head Pure_EV_r1_100k.aligned.sam. Can you see why?

Although samtools view has many options, lets just use it to convert to the compressed binary BAM format. We can then see that the same view command still works.

samtools view -b Pure_EV_r1_100k.aligned.sam > Pure_EV_r1_100k.aligned.bam

ls -l Pure_EV_r1_100k.aligned.*
## -rw-r--r--@ 1 cei  staff   2993644 May 18 18:23 Pure_EV_r1_100k.aligned.bam
## -rw-r--r--@ 1 cei  staff  10574594 May 18 18:23 Pure_EV_r1_100k.aligned.bwt
## -rw-r--r--@ 1 cei  staff  14004776 May 18 18:23 Pure_EV_r1_100k.aligned.sam

As you can see above, the same file takes much less space in BAM format. In general you should try to avoid keeping SAM files: always convert them to BAM.


A Genome Browser example: IGV

If we want to view the alignment on a genome browser such as IGV (see below) we first need to sort our alignment. This means sorting the reads according to their mapping position, and not according to their position in the raw sequencing file.

samtools sort Pure_EV_r1_100k.aligned.bam > Pure_EV_r1_100k.sorted.bam

samtools view Pure_EV_r1_100k.aligned.bam | cut -f 2-4 | head
## 4    *   0
## 16   II  2384332
## 16   I   553645
## 16   I   464569
## 4    *   0
## 0    I   9467611
## 0    I   5014506
## 0    II  5699455
## 0    I   9467129
## 16   II  5328544
samtools view Pure_EV_r1_100k.sorted.bam | cut -f 2-4 | head
## 0    I   26488
## 0    I   26488
## 0    I   26634
## 0    I   26635
## 0    I   26665
## 0    I   26665
## 0    I   26665
## 0    I   26665
## 0    I   26665
## 0    I   26665
  • Can you see what the sorting has achieved?

Some programs will need you to further index these sorted BAM files, which is achieved as follows:

samtools index Pure_EV_r1_100k.sorted.bam

ls -l Pure_EV_r1_100k.sorted.*
## -rw-r--r--@ 1 cei  staff  1810811 May 18 18:23 Pure_EV_r1_100k.sorted.bam
## -rw-r--r--@ 1 cei  staff    52080 May 18 18:23 Pure_EV_r1_100k.sorted.bam.bai

Several genome browsers can directly import alignments in sorted BAM format, so we can see where our aligned reads are piling up across the chromosomes. One example of such a browser is (IGV), which you can download and run locally for best performance and flexibility.

There is also a web version of the IGV app. I have inserted an instance of this below, having loaded the following files (which are just small variations of the files we have been using in this practical):

If you want to use your own browser, with everything pre-loaded, you can also use this link



Other tools

bedtools: a powerful toolset for genome arithmetic, including converting bam to different formats, generating coverage tracks, finding overlaps between reads and annotation, etc. https://bedtools.readthedocs.io/en/latest/index.html

I’ve used bedtools to produce separate coverage tracks for reads that map to positive or negative strand. The below is part of a bash script:

bedtools genomecov -ibam $bamFile -bg -strand '+' > $outBase.us
bedtools genomecov -ibam $bamFile -bg -strand '-' | sed -r 's/(\S+)$/-\1/' >> $outBase.us
bedtools sort -i $outBase.us | gzip -c > $outBase.bedGraph.gz
rm $outBase.us

More help on the genomecov command: https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html