Introduction to Bioconductor with Biostrings

Bioconductor is an international initiative that provides a series of libraries or packages that allow us to perform biologically relevant analyses within R. You can read more about this project at: www.bioconductor.org

In this practical we will start working with a Bioconductor package designed for manipulating biological sequences. Biostrings is designed to be able to quickly and efficiently work with very long strings of characters (DNA or RNA sequences). One advantage is that the syntax will be the same as with regular R functions, so what you have already learned can immediately be put to use with new functionality.

Installing Bioconductor and Biostrings

Remember to check out the brief installation tutorial

First check that you do in fact have Biostrings installed:

library(Biostrings)

If you get an error, please stop and make sure you install it first.

Very briefly, you would do the following:

BiocManager::install()

BiocManager::install(c("Biostrings","Biobase"))

The complete list of packages that can be installed through Bioconductor is available at: packages

If you explore them a bit, you will quickly realise that they are way too many to try to install and learn about all of them. More realistically, when you have a particular problem or type of data, that is the time to see what package is available that could help you out.

It is also important to mention that Bioconductor versions are closely tied to certain versions of R. So, if you want to have the most recent version of Bioconductor you need to have the most recent version of R. New versions of Bioconductor are released every six months.

Manipulating sequences with Biostrings

Let’s see what Biostrings can offer us. Let’s load the Biostrings package (library), and also load Biobase (adds some basic functions):

library(Biobase)
library(Biostrings)

Biostrings defines a new class of object, called XString. These can be divided into BString (generic strings), DNAString, RNAString or AAString. Let’s see how to work with a DNAString.

dnaSeq <- DNAString("TTCAGATCTAGTTCGTGTGTGACTGATGATCTGTCACACGTTTTTCTGATCTTCTGACTAGTCGAT")
dnaSeq

## 66-letter DNAString object
## seq: TTCAGATCTAGTTCGTGTGTGACTGATGATCTGTCACACGTTTTTCTGATCTTCTGACTAGTCGAT

class(dnaSeq)

## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"

As you might expect from working with R, we can access part of the sequence using [ ]:

dnaSeq[4]

## 1-letter DNAString object
## seq: A

dnaSeq[4:10]

## 7-letter DNAString object
## seq: AGATCTA

There are many functions that work with XString objects, some are new, and some are old friends:

length(dnaSeq)

## [1] 66

head(dnaSeq)

## 6-letter DNAString object
## seq: TTCAGA

tail(dnaSeq)

## 6-letter DNAString object
## seq: GTCGAT

reverseComplement(dnaSeq)

## 66-letter DNAString object
## seq: ATCGACTAGTCAGAAGATCAGAAAAACGTGTGACAGATCATCAGTCACACACGAACTAGATCTGAA

alphabetFrequency(dnaSeq)

##  A  C  G  T  M  R  W  S  Y  K  V  H  D  B  N  -  +  . 
## 12 13 14 27  0  0  0  0  0  0  0  0  0  0  0  0  0  0

As before, we really do not want to have to type our sequences into R. Biostrings provides functions for reading and writing to FASTA files (one of the most common file format to store sequences).

For this exercise, you can download a file that has some sequences from NCBI’s UniVec Database, which is a common resource to identify potential contaminants resulting from vectors, linkers and primers commonly used during cloning experiments: UniVec_subset.fa

In many high-throughput sequencing experiments we will want to check that we don’t have contamination of sequences matching to the UniVec database.

Make sure to save the file in your working directory.

dnaSeqs <- readDNAStringSet("UniVec_subset.fa")
dnaSeqs

## DNAStringSet object of length 404:
##       width seq                                                                                     names               
##   [1]  2736 CTCGGGCCGTCTCTTGGGCTTGATCGGCCTTCTTGCGCATCT...CGTCTCTTGGGCTTGATCGGCCTTCTTGCGCATCTCACGCGC univec|X66730.1:1...
##   [2]  4410 TTCTCATGTTTGACAGCTTATCATCGATAAGCTTTAATGCGG...GTTTGACAGCTTATCATCGATAAGCTTTAATGCGGTAGTTTA univec|J01749.1:1...
##   [3]  5292 GCCTCGGCCTCTGCATAAATAAAAAAAATTAGTCAGCCATGG...CCTCTGCATAAATAAAAAAAATTAGTCAGCCATGGGGCGGAG univec|J02400.1:1...
##   [4]  5435 GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGA...ATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCT univec|J02482.1:1...
##   [5]  6456 AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCT...CTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCC univec|V00604.2:1...
##   ...   ... ...
## [400]   624 GAACTTATCGGCACTGACGGTTACCTTGTTCTGCGCTGGCTC...CGATGCCCTTGAGAGCCTTCAACCCAGTCAGCTCCTTCCGGT univec|M99566.1:1...
## [401]   258 GTTTTCCCAGTCACGACGTTGTAAAACGACGGCCAGTATGAG...CGCCTGCAGGCATGCCGTAATCATGGTCATAGCTGTTTCCTG univec|D16587.1:1...
## [402]   319 AGATCTCGATCCCGCGAAATTAATACGACTCACTATAGGGGA...CTGCTAACAAAGCCCGAAAGGAAGCTGAGTTGGCTGCTGCCA univec|NGB00603.1...
## [403]   224 ACGGCCAGTGAATTGTAATACGACTCACTATAGGGCGAATTG...GCCATAAGGGCCTGATCCTTCGAGGGGGGGCCCGGTACCAGC univec|NGB00075.1...
## [404]   234 GTTGTAAAACGACGGCCAGTGAATTCGAGCTCGGTACCGTAA...TCGACCTGCAGGCATGCAAGCTTGGCGTAATCATGGTCATAG univec|NGB00153.1...

A DNAStringSet, as the name implies, is a set (list in this case) of DNAString objects. These are some useful functions that allow us to manipulate them. You can look up the help page for any of them to learn more.

length(dnaSeqs)

## [1] 404

head(names(dnaSeqs))

## [1] "univec|X66730.1:1-2687-49 B.bronchiseptica plasmid pBBR1 genes for mobilization and replication"
## [2] "univec|J01749.1:1-4361-49 Cloning vector pBR322"                                                
## [3] "univec|J02400.1:1-5243-49 Simian virus 40"                                                      
## [4] "univec|J02482.1:1-5386-49 Coliphage phi-X174"                                                   
## [5] "univec|V00604.2:1-6407-49 Phage M13 genome"                                                     
## [6] "univec|J02459.1:1-48502-49 Enterobacteria phage lambda"

range(width(dnaSeqs))

## [1]   200 48551

summary(width(dnaSeqs))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   200.0   274.0   591.5  1119.4  1027.8 48551.0

subseq(dnaSeqs,11,30)

## DNAStringSet object of length 404:
##       width seq                                                                                     names               
##   [1]    20 CTCTTGGGCTTGATCGGCCT                                                                    univec|X66730.1:1...
##   [2]    20 TGACAGCTTATCATCGATAA                                                                    univec|J01749.1:1...
##   [3]    20 CTGCATAAATAAAAAAAATT                                                                    univec|J02400.1:1...
##   [4]    20 GCTTCCATGACGCAGAAGTT                                                                    univec|J02482.1:1...
##   [5]    20 CTATTAGTAGAATTGATGCC                                                                    univec|V00604.2:1...
##   ...   ... ...
## [400]    20 GCACTGACGGTTACCTTGTT                                                                    univec|M99566.1:1...
## [401]    20 TCACGACGTTGTAAAACGAC                                                                    univec|D16587.1:1...
## [402]    20 CCCGCGAAATTAATACGACT                                                                    univec|NGB00603.1...
## [403]    20 AATTGTAATACGACTCACTA                                                                    univec|NGB00075.1...
## [404]    20 GACGGCCAGTGAATTCGAGC                                                                    univec|NGB00153.1...

head(alphabetFrequency(dnaSeqs, as.prob=TRUE))

##              A         C         G         T M R W S Y K V H D B N - + .
## [1,] 0.1619152 0.3333333 0.3132310 0.1915205 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0.2256236 0.2761905 0.2591837 0.2390023 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 0.2898715 0.2099395 0.1987906 0.3013983 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [4,] 0.2397424 0.2147194 0.2323827 0.3131555 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [5,] 0.2459727 0.2030669 0.2047708 0.3461896 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [6,] 0.2542069 0.2342279 0.2643406 0.2472246 0 0 0 0 0 0 0 0 0 0 0 0 0 0

head(dinucleotideFrequency(dnaSeqs, as.prob=TRUE))

##              AA         AC         AG         AT         CA         CC          CG         CT         GA         GC
## [1,] 0.03912249 0.04753199 0.04387569 0.03144424 0.06069470 0.09872029 0.106398537 0.06727605 0.04643510 0.13199269
## [2,] 0.05828986 0.05284645 0.05511454 0.05919710 0.06917668 0.06645498 0.075073713 0.06554774 0.05398049 0.08664096
## [3,] 0.10224910 0.05518806 0.06577207 0.06671707 0.08089208 0.04725005 0.005481005 0.07635608 0.04536005 0.05121905
## [4,] 0.07305852 0.04858300 0.04692676 0.07121826 0.04784689 0.04232609 0.049687155 0.07489879 0.06091277 0.06422525
## [5,] 0.08350116 0.04213788 0.03888459 0.08148722 0.04198296 0.04151820 0.041673122 0.07776917 0.04492641 0.04972889
## [6,] 0.07610711 0.05301751 0.05627188 0.06879506 0.06624099 0.05147271 0.064243048 0.05227600 0.06710608 0.07454171
##              GG         GT         TA         TC         TG         TT
## [1,] 0.08994516 0.04497258 0.01572212 0.05484461 0.07312614 0.04789762
## [2,] 0.06486732 0.05375369 0.04422772 0.07031073 0.06418689 0.06033114
## [3,] 0.05329805 0.04876205 0.06142506 0.05632206 0.07408807 0.10962011
## [4,] 0.04711078 0.06017667 0.05796835 0.05962459 0.08851675 0.10691940
## [5,] 0.05050349 0.05964369 0.07544539 0.06971340 0.07374129 0.12734314
## [6,] 0.06564367 0.05705458 0.04475798 0.05520082 0.07816684 0.06910402

reverseComplement(dnaSeqs)

## DNAStringSet object of length 404:
##       width seq                                                                                     names               
##   [1]  2736 GCGCGTGAGATGCGCAAGAAGGCCGATCAAGCCCAAGAGACG...AGATGCGCAAGAAGGCCGATCAAGCCCAAGAGACGGCCCGAG univec|X66730.1:1...
##   [2]  4410 TAAACTACCGCATTAAAGCTTATCGATGATAAGCTGTCAAAC...CCGCATTAAAGCTTATCGATGATAAGCTGTCAAACATGAGAA univec|J01749.1:1...
##   [3]  5292 CTCCGCCCCATGGCTGACTAATTTTTTTTATTTATGCAGAGG...CCATGGCTGACTAATTTTTTTTATTTATGCAGAGGCCGAGGC univec|J02400.1:1...
##   [4]  5435 AGAAATATCCGAAAGTGTTAACTTCTGCGTCATGGAAGCGAT...TCCGAAAGTGTTAACTTCTGCGTCATGGAAGCGATAAAACTC univec|J02482.1:1...
##   [5]  6456 GGGCGCGAGCTGAAAAGGTGGCATCAATTCTACTAATAGTAG...AGCTGAAAAGGTGGCATCAATTCTACTAATAGTAGTAGCGTT univec|V00604.2:1...
##   ...   ... ...
## [400]   624 ACCGGAAGGAGCTGACTGGGTTGAAGGCTCTCAAGGGCATCG...GAGCCAGCGCAGAACAAGGTAACCGTCAGTGCCGATAAGTTC univec|M99566.1:1...
## [401]   258 CAGGAAACAGCTATGACCATGATTACGGCATGCCTGCAGGCG...CTCATACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAAC univec|D16587.1:1...
## [402]   319 TGGCAGCAGCCAACTCAGCTTCCTTTCGGGCTTTGTTAGCAG...TCCCCTATAGTGAGTCGTATTAATTTCGCGGGATCGAGATCT univec|NGB00603.1...
## [403]   224 GCTGGTACCGGGCCCCCCCTCGAAGGATCAGGCCCTTATGGC...CAATTCGCCCTATAGTGAGTCGTATTACAATTCACTGGCCGT univec|NGB00075.1...
## [404]   234 CTATGACCATGATTACGCCAAGCTTGCATGCCTGCAGGTCGA...TTACGGTACCGAGCTCGAATTCACTGGCCGTCGTTTTACAAC univec|NGB00153.1...

If you really want to (you don’t really really want to), you can convert a DNAString back into normal text:

as.character(dnaSeqs[[1]][1:20])

## [1] "CTCGGGCCGTCTCTTGGGCT"

Finally, all Bioconductor packages have one or more manuals called vignettes that are installed alongside. To access a list of available vignettes (corresponding to all currently loaded packages), use the function:

openVignette()

Then you can choose the number of the manual you would like to read, which should open as a PDF. For instance, you should see one called “Biostrings - Biostrings Quick Overview” that gives you a brief list of the functionality that Biostrings provides you with.

Basic exercises

Using the dnaSeqs object:

What is the length of the shortest sequence?
And the longest sequence?
How many sequences are longer than 1000 nt?
How many sequences are longer than 300 but shorter than 800 nt?
- hint: you can combine two logical vectors with & (‘and’) or | (‘or’)
Plot a histogram of the length of all UniVec sequences (using hist).

Additional exercises

Note: all these exercises can be solved using exclusively functions covered in these practicals.

What is the percent of Gs or Cs of the shortest sequence (hints: ?alphabetFrequency, order or which.min)
Get the first 3 nucleotides of each sequence. Which is the most frequent trinucleotide among them? (hints: table and sort)
Now get the last 3 nucleotides of each sequence. Try to do this in more than one way.
Which sequence has the most CG dinucleotides? What if you take length into account (i.e. sequence with highest frequency of CG dinucleotides)?
Show the nucleotide frequence of the first 100 sequences with a plot (barplot)

Advanced exercise

Repeat the same barplot as above, but now ordering those first 100 sequences by C+G content (hint: order).