Bioconductor is an international initiative that provides a series of libraries or packages that allow us to perform biologically relevant analyses within R. You can read more about this project at: www.bioconductor.org
In this practical we will start working with a Bioconductor package
designed for manipulating biological sequences. Biostrings
is designed to be able to quickly and efficiently work with very long
strings of characters (DNA or RNA sequences). One advantage is that the
syntax will be the same as with regular R functions, so what you have
already learned can immediately be put to use with new
functionality.
Remember to check out the brief installation tutorial
First check that you do in fact have Biostrings installed:
If you get an error, please stop and make sure you install it first.
Very briefly, you would do the following:
The complete list of packages that can be installed through Bioconductor is available at: packages
If you explore them a bit, you will quickly realise that they are way too many to try to install and learn about all of them. More realistically, when you have a particular problem or type of data, that is the time to see what package is available that could help you out.
It is also important to mention that Bioconductor versions are closely tied to certain versions of R. So, if you want to have the most recent version of Bioconductor you need to have the most recent version of R. New versions of Bioconductor are released every six months.
Let’s see what Biostrings can offer us. Let’s load the Biostrings package (library), and also load Biobase (adds some basic functions):
Biostrings defines a new class of object, called XString. These can be divided into BString (generic strings), DNAString, RNAString or AAString. Let’s see how to work with a DNAString.
## 66-letter DNAString object
## seq: TTCAGATCTAGTTCGTGTGTGACTGATGATCTGTCACACGTTTTTCTGATCTTCTGACTAGTCGAT
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
As you might expect from working with R, we can access part of the sequence using [ ]:
## 1-letter DNAString object
## seq: A
## 7-letter DNAString object
## seq: AGATCTA
There are many functions that work with XString objects, some are new, and some are old friends:
## [1] 66
## 6-letter DNAString object
## seq: TTCAGA
## 6-letter DNAString object
## seq: GTCGAT
## 66-letter DNAString object
## seq: ATCGACTAGTCAGAAGATCAGAAAAACGTGTGACAGATCATCAGTCACACACGAACTAGATCTGAA
## A C G T M R W S Y K V H D B N - + .
## 12 13 14 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0
As before, we really do not want to have to type our sequences into R. Biostrings provides functions for reading and writing to FASTA files (one of the most common file format to store sequences).
For this exercise, you can download a file that has some sequences from NCBI’s UniVec Database, which is a common resource to identify potential contaminants resulting from vectors, linkers and primers commonly used during cloning experiments: UniVec_subset.fa
In many high-throughput sequencing experiments we will want to check that we don’t have contamination of sequences matching to the UniVec database.
Make sure to save the file in your working directory.
## DNAStringSet object of length 404:
## width seq names
## [1] 2736 CTCGGGCCGTCTCTTGGGCTTGATCGGCCTTCTTGCGCATCT...CGTCTCTTGGGCTTGATCGGCCTTCTTGCGCATCTCACGCGC univec|X66730.1:1...
## [2] 4410 TTCTCATGTTTGACAGCTTATCATCGATAAGCTTTAATGCGG...GTTTGACAGCTTATCATCGATAAGCTTTAATGCGGTAGTTTA univec|J01749.1:1...
## [3] 5292 GCCTCGGCCTCTGCATAAATAAAAAAAATTAGTCAGCCATGG...CCTCTGCATAAATAAAAAAAATTAGTCAGCCATGGGGCGGAG univec|J02400.1:1...
## [4] 5435 GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGA...ATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCT univec|J02482.1:1...
## [5] 6456 AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCT...CTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCC univec|V00604.2:1...
## ... ... ...
## [400] 624 GAACTTATCGGCACTGACGGTTACCTTGTTCTGCGCTGGCTC...CGATGCCCTTGAGAGCCTTCAACCCAGTCAGCTCCTTCCGGT univec|M99566.1:1...
## [401] 258 GTTTTCCCAGTCACGACGTTGTAAAACGACGGCCAGTATGAG...CGCCTGCAGGCATGCCGTAATCATGGTCATAGCTGTTTCCTG univec|D16587.1:1...
## [402] 319 AGATCTCGATCCCGCGAAATTAATACGACTCACTATAGGGGA...CTGCTAACAAAGCCCGAAAGGAAGCTGAGTTGGCTGCTGCCA univec|NGB00603.1...
## [403] 224 ACGGCCAGTGAATTGTAATACGACTCACTATAGGGCGAATTG...GCCATAAGGGCCTGATCCTTCGAGGGGGGGCCCGGTACCAGC univec|NGB00075.1...
## [404] 234 GTTGTAAAACGACGGCCAGTGAATTCGAGCTCGGTACCGTAA...TCGACCTGCAGGCATGCAAGCTTGGCGTAATCATGGTCATAG univec|NGB00153.1...
A DNAStringSet, as the name implies, is a set (list in this case) of DNAString objects. These are some useful functions that allow us to manipulate them. You can look up the help page for any of them to learn more.
## [1] 404
## [1] "univec|X66730.1:1-2687-49 B.bronchiseptica plasmid pBBR1 genes for mobilization and replication"
## [2] "univec|J01749.1:1-4361-49 Cloning vector pBR322"
## [3] "univec|J02400.1:1-5243-49 Simian virus 40"
## [4] "univec|J02482.1:1-5386-49 Coliphage phi-X174"
## [5] "univec|V00604.2:1-6407-49 Phage M13 genome"
## [6] "univec|J02459.1:1-48502-49 Enterobacteria phage lambda"
## [1] 200 48551
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 200.0 274.0 591.5 1119.4 1027.8 48551.0
## DNAStringSet object of length 404:
## width seq names
## [1] 20 CTCTTGGGCTTGATCGGCCT univec|X66730.1:1...
## [2] 20 TGACAGCTTATCATCGATAA univec|J01749.1:1...
## [3] 20 CTGCATAAATAAAAAAAATT univec|J02400.1:1...
## [4] 20 GCTTCCATGACGCAGAAGTT univec|J02482.1:1...
## [5] 20 CTATTAGTAGAATTGATGCC univec|V00604.2:1...
## ... ... ...
## [400] 20 GCACTGACGGTTACCTTGTT univec|M99566.1:1...
## [401] 20 TCACGACGTTGTAAAACGAC univec|D16587.1:1...
## [402] 20 CCCGCGAAATTAATACGACT univec|NGB00603.1...
## [403] 20 AATTGTAATACGACTCACTA univec|NGB00075.1...
## [404] 20 GACGGCCAGTGAATTCGAGC univec|NGB00153.1...
## A C G T M R W S Y K V H D B N - + .
## [1,] 0.1619152 0.3333333 0.3132310 0.1915205 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0.2256236 0.2761905 0.2591837 0.2390023 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 0.2898715 0.2099395 0.1987906 0.3013983 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [4,] 0.2397424 0.2147194 0.2323827 0.3131555 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [5,] 0.2459727 0.2030669 0.2047708 0.3461896 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [6,] 0.2542069 0.2342279 0.2643406 0.2472246 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## AA AC AG AT CA CC CG CT GA GC
## [1,] 0.03912249 0.04753199 0.04387569 0.03144424 0.06069470 0.09872029 0.106398537 0.06727605 0.04643510 0.13199269
## [2,] 0.05828986 0.05284645 0.05511454 0.05919710 0.06917668 0.06645498 0.075073713 0.06554774 0.05398049 0.08664096
## [3,] 0.10224910 0.05518806 0.06577207 0.06671707 0.08089208 0.04725005 0.005481005 0.07635608 0.04536005 0.05121905
## [4,] 0.07305852 0.04858300 0.04692676 0.07121826 0.04784689 0.04232609 0.049687155 0.07489879 0.06091277 0.06422525
## [5,] 0.08350116 0.04213788 0.03888459 0.08148722 0.04198296 0.04151820 0.041673122 0.07776917 0.04492641 0.04972889
## [6,] 0.07610711 0.05301751 0.05627188 0.06879506 0.06624099 0.05147271 0.064243048 0.05227600 0.06710608 0.07454171
## GG GT TA TC TG TT
## [1,] 0.08994516 0.04497258 0.01572212 0.05484461 0.07312614 0.04789762
## [2,] 0.06486732 0.05375369 0.04422772 0.07031073 0.06418689 0.06033114
## [3,] 0.05329805 0.04876205 0.06142506 0.05632206 0.07408807 0.10962011
## [4,] 0.04711078 0.06017667 0.05796835 0.05962459 0.08851675 0.10691940
## [5,] 0.05050349 0.05964369 0.07544539 0.06971340 0.07374129 0.12734314
## [6,] 0.06564367 0.05705458 0.04475798 0.05520082 0.07816684 0.06910402
## DNAStringSet object of length 404:
## width seq names
## [1] 2736 GCGCGTGAGATGCGCAAGAAGGCCGATCAAGCCCAAGAGACG...AGATGCGCAAGAAGGCCGATCAAGCCCAAGAGACGGCCCGAG univec|X66730.1:1...
## [2] 4410 TAAACTACCGCATTAAAGCTTATCGATGATAAGCTGTCAAAC...CCGCATTAAAGCTTATCGATGATAAGCTGTCAAACATGAGAA univec|J01749.1:1...
## [3] 5292 CTCCGCCCCATGGCTGACTAATTTTTTTTATTTATGCAGAGG...CCATGGCTGACTAATTTTTTTTATTTATGCAGAGGCCGAGGC univec|J02400.1:1...
## [4] 5435 AGAAATATCCGAAAGTGTTAACTTCTGCGTCATGGAAGCGAT...TCCGAAAGTGTTAACTTCTGCGTCATGGAAGCGATAAAACTC univec|J02482.1:1...
## [5] 6456 GGGCGCGAGCTGAAAAGGTGGCATCAATTCTACTAATAGTAG...AGCTGAAAAGGTGGCATCAATTCTACTAATAGTAGTAGCGTT univec|V00604.2:1...
## ... ... ...
## [400] 624 ACCGGAAGGAGCTGACTGGGTTGAAGGCTCTCAAGGGCATCG...GAGCCAGCGCAGAACAAGGTAACCGTCAGTGCCGATAAGTTC univec|M99566.1:1...
## [401] 258 CAGGAAACAGCTATGACCATGATTACGGCATGCCTGCAGGCG...CTCATACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAAC univec|D16587.1:1...
## [402] 319 TGGCAGCAGCCAACTCAGCTTCCTTTCGGGCTTTGTTAGCAG...TCCCCTATAGTGAGTCGTATTAATTTCGCGGGATCGAGATCT univec|NGB00603.1...
## [403] 224 GCTGGTACCGGGCCCCCCCTCGAAGGATCAGGCCCTTATGGC...CAATTCGCCCTATAGTGAGTCGTATTACAATTCACTGGCCGT univec|NGB00075.1...
## [404] 234 CTATGACCATGATTACGCCAAGCTTGCATGCCTGCAGGTCGA...TTACGGTACCGAGCTCGAATTCACTGGCCGTCGTTTTACAAC univec|NGB00153.1...
If you really want to (you don’t really really want to), you can convert a DNAString back into normal text:
## [1] "CTCGGGCCGTCTCTTGGGCT"
Finally, all Bioconductor packages have one or more manuals called vignettes that are installed alongside. To access a list of available vignettes (corresponding to all currently loaded packages), use the function:
Then you can choose the number of the manual you would like to read, which should open as a PDF. For instance, you should see one called “Biostrings - Biostrings Quick Overview” that gives you a brief list of the functionality that Biostrings provides you with.
Using the dnaSeqs object:
hist).Note: all these exercises can be solved using exclusively functions covered in these practicals.
?alphabetFrequency, order or
which.min)table
and sort)barplot)order).