Bioconductor is an international initiative that provides a series of libraries or packages that allow us to perform biologically relevant analyses within R. You can read more about this project at: www.bioconductor.org
In this practical we will start working with a Bioconductor package designed for manipulating biological sequences. Biostrings is designed to be able to quickly and efficiently work with very long strings of characters (DNA or RNA sequences). One advantage is that the synthax will be the same as with regular R functions, so what you have already learned can immediately be put to use with new functionality.
Remember to check out the brief installation tutorial: practical00
First check that you do in fact have Biostrings installed:
library(Biostrings)If you get an error, please stop and make sure you install it first.
Very briefly, you would do the following:
BiocManager::install()
BiocManager::install("Biostrings")The complete list of packages that can be installed through Bioconductor is available at: packages
If you explore them a bit, you will quickly realise that they are way too many to try to install and learn about all of them. More realistically, when you have a particular problem or type of data, that is the time to see what package is available that could help you out.
It is also important to mention that Bioconductor versions are closely tied to certain versions of R. So, if you want to have the most recent version of Bioconductor you need to have the most recent version of R. New versions of Bioconductor are prepared every six months.
Let’s see what Biostrings can offer us. Let’s load the Biostrings package (library), and also load Biobase (adds some basic functions):
library(Biobase)
library(Biostrings)Biostrings defines a new class of object, called XString. These can be divided into BString (generic strings), DNAString, RNAString or AAString. Let’s see how to work with a DNAString.
dnaSeq <- DNAString("TTCAGATCTAGTTCGTGTGTGACTGATGATCTGTCACACGTTTTTCTGATCTTCTGACTAGTCGAT")
dnaSeq## 66-letter DNAString object
## seq: TTCAGATCTAGTTCGTGTGTGACTGATGATCTGTCACACGTTTTTCTGATCTTCTGACTAGTCGAT
class(dnaSeq)## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
As you might expect from working with R, we can access part of the sequence using [ ]:
dnaSeq[4]## 1-letter DNAString object
## seq: A
dnaSeq[4:10]## 7-letter DNAString object
## seq: AGATCTA
There are many functions that work with XString objects, some are new, and some are old friends:
length(dnaSeq)## [1] 66
head(dnaSeq)## 6-letter DNAString object
## seq: TTCAGA
tail(dnaSeq)## 6-letter DNAString object
## seq: GTCGAT
reverseComplement(dnaSeq)## 66-letter DNAString object
## seq: ATCGACTAGTCAGAAGATCAGAAAAACGTGTGACAGATCATCAGTCACACACGAACTAGATCTGAA
alphabetFrequency(dnaSeq)## A C G T M R W S Y K V H D B N - + .
## 12 13 14 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0
As before, we really do not want to have to type our sequences into R. Biostrings provides functions for reading and writing to FASTA files (one of the most common file format to store sequences). For this exercise, you can download a file that has some gene sequences from the bacterium E. coli: ecoliORFs.fa
Make sure to save the file in your working directory.
dnaSeqs <- readDNAStringSet("ecoliORFs.fa")
dnaSeqs## DNAStringSet object of length 11:
## width seq names
## [1] 66 ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA thrL
## [2] 2463 ATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAG...GCTGATCTGCTACGTACCCTCTCATGGAAGTTAGGAGTCTGA thrA
## [3] 933 ATGGTTAAAGTTTATGCCCCGGCTTCCAGTGCCAATATGAGCG...TGCCGGCTGGATACGGCGGGCGCACGAGTACTGGAAAACTAA thrB
## [4] 1287 ATGAAACTCTACAATCTGAAAGATCACAACGAGCAGGTCAGCT...GATTTTGCTGCGTTGCGTAAATTGATGATGAATCATCAGTAA thrC
## [5] 504 ATGCCGGGCAACAGCCCGCATTATGGGCGTTGGCCTCAACACG...AAAGTCATCGGGCATTATCTGAACATAAAACACTATCAATAA insB
## ... ... ...
## [7] 606 ATGGCAGAGAAATTTATCAAACACACAGGCCTGGTGGTTCCGC...ATTGCCGCTTATGAAGCAAAACAACCTGCGTTTATGAATTAA leuD
## [8] 1401 ATGGCTAAGACGTTATACGAAAAATTGTTCGACGCTCACGTTG...GCTGTGACCGGACATTTCGCCGACATTCGCAACATTAAATAA leuC
## [9] 1092 ATGTCGAAGAATTACCATATTGCCGTATTGCCGGGGGACGGTA...ATGGGCGATATCATTGCCCGCTATGTAGCAGAAGGGGTGTAA leuB
## [10] 1572 ATGAGCCAGCAAGTCATTATTTTCGATACCACATTGCGCGACG...CGCAAAGCTCAACACAACGAAAACAACAAGGAAACCGTGTGA leuA
## [11] 87 ATGACTCACATCGTTCGCTTTATCGGTCTACTACTACTAAACGCATCTTCTTTGCGCGGTAGACGAGTGAGCGGCATCCAGCATTAA leuL
A DNAStringSet, as the name implies, is a set (list in this case) of DNAString objects. These are some useful functions that allow us to manipulate them. You can look up the help page for any of them to learn more.
length(dnaSeqs)## [1] 11
names(dnaSeqs)## [1] "thrL" "thrA" "thrB" "thrC" "insB" "insA" "leuD" "leuC" "leuB" "leuA" "leuL"
width(dnaSeqs)## [1] 66 2463 933 1287 504 276 606 1401 1092 1572 87
subseq(dnaSeqs,11,30)## DNAStringSet object of length 11:
## width seq names
## [1] 20 TTAGCACCACCATTACCACC thrL
## [2] 20 TGAAGTTCGGCGGTACATCA thrA
## [3] 20 TTTATGCCCCGGCTTCCAGT thrB
## [4] 20 ACAATCTGAAAGATCACAAC thrC
## [5] 20 ACAGCCCGCATTATGGGCGT insB
## ... ... ...
## [7] 20 AATTTATCAAACACACAGGC leuD
## [8] 20 CGTTATACGAAAAATTGTTC leuC
## [9] 20 ATTACCATATTGCCGTATTG leuB
## [10] 20 AAGTCATTATTTTCGATACC leuA
## [11] 20 TCGTTCGCTTTATCGGTCTA leuL
dinucleotideFrequency(dnaSeqs)## AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
## [1,] 3 10 2 5 11 7 3 1 2 4 4 2 4 1 3 3
## [2,] 176 113 102 161 130 143 199 143 160 220 170 142 86 139 221 157
## [3,] 52 40 52 50 52 50 80 49 59 94 85 56 31 47 77 58
## [4,] 103 66 65 70 74 64 105 73 93 119 79 73 34 67 115 86
## [5,] 36 31 21 33 34 18 45 29 29 50 43 27 22 27 40 18
## [6,] 21 20 9 14 25 16 18 21 7 25 19 15 12 19 19 15
## [7,] 50 37 29 33 36 34 45 35 44 55 37 26 19 24 51 50
## [8,] 131 86 54 74 90 101 133 62 80 133 110 72 44 66 98 66
## [9,] 69 60 50 73 75 73 100 50 70 119 67 52 38 46 91 58
## [10,] 138 91 78 103 111 97 138 65 100 136 97 89 61 87 109 71
## [11,] 3 7 4 6 6 1 8 8 4 7 3 4 7 8 3 7
reverseComplement(dnaSeqs)## DNAStringSet object of length 11:
## width seq names
## [1] 66 TCAGCCCGCACCGTTACCTGTGGTAATGGTGATGGTGGTGGTAATGGTGGTGCTAATGCGTTTCAT thrL
## [2] 2463 TCAGACTCCTAACTTCCATGAGAGGGTACGTAGCAGATCAGCA...TGCATTTGCCACTGATGTACCGCCGAACTTCAACACTCGCAT thrA
## [3] 933 TTAGTTTTCCAGTACTCGTGCGCCCGCCGTATCCAGCCGGCAA...GCTCATATTGGCACTGGAAGCCGGGGCATAAACTTTAACCAT thrB
## [4] 1287 TTACTGATGATTCATCATCAATTTACGCAACGCAGCAAAATCG...GCTGACCTGCTCGTTGTGATCTTTCAGATTGTAGAGTTTCAT thrC
## [5] 504 TTATTGATAGTGTTTTATGTTCAGATAATGCCCGATGACTTTG...GTGTTGAGGCCAACGCCCATAATGCGGGCTGTTGCCCGGCAT insB
## ... ... ...
## [7] 606 TTAATTCATAAACGCAGGTTGTTTTGCTTCATAAGCGGCAATG...CGGAACCACCAGGCCTGTGTGTTTGATAAATTTCTCTGCCAT leuD
## [8] 1401 TTATTTAATGTTGCGAATGTCGGCGAAATGTCCGGTCACAGCA...AACGTGAGCGTCGAACAATTTTTCGTATAACGTCTTAGCCAT leuC
## [9] 1092 TTACACCCCTTCTGCTACATAGCGGGCAATGATATCGCCCATT...ACCGTCCCCCGGCAATACGGCAATATGGTAATTCTTCGACAT leuB
## [10] 1572 TCACACGGTTTCCTTGTTGTTTTCGTTGTGTTGAGCTTTGCGT...GTCGCGCAATGTGGTATCGAAAATAATGACTTGCTGGCTCAT leuA
## [11] 87 TTAATGCTGGATGCCGCTCACTCGTCTACCGCGCAAAGAAGATGCGTTTAGTAGTAGTAGACCGATAAAGCGAACGATGTGAGTCAT leuL
translate(dnaSeqs)## AAStringSet object of length 11:
## width seq names
## [1] 22 MKRISTTITTTITITTGNGAG* thrL
## [2] 821 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKI...FYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV* thrA
## [3] 311 MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAET...CDKPETAQRVADWLGKNYLQNQEGFVHICRLDTAGARVLEN* thrB
## [4] 429 MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLTEI...AILGETLDLPKELAERADLPLLSHNLPADFAALRKLMMNHQ* thrC
## [5] 168 MPGNSPHYGRWPQHDFTSLKKLRPQSVTSRIQPGSDVIVCAEM...RHNLNLRQHLARLGRKSLSFSKSVELHDKVIGHYLNIKHYQ* insB
## ... ... ...
## [7] 202 MAEKFIKHTGLVVPLDAANVDTDAIIPKQFLQKVTRTGFGAHL...RFTIDAFRRHCMMNGLDSIGLTLQHDDAIAAYEAKQPAFMN* leuD
## [8] 467 MAKTLYEKLFDAHVVYEAENETPLLYIDRHLVHEVTSPQAFDG...STSNRNFEGRQGRGGRTHLVSPAMAAAAAVTGHFADIRNIK* leuC
## [9] 364 MSKNYHIAVLPGDGIGPEVMTQALKVLDAVRNRFAMRITTSHY...ERAINRALEEGIRTGDLARGAAAVSTDEMGDIIARYVAEGV* leuB
## [10] 524 MSQQVIIFDTTLRDGEQALQASLSVKEKLQIALALERMGVDVM...DIVESSAKAMVHVLNNIWRAAEVEKELQRKAQHNENNKETV* leuA
## [11] 29 MTHIVRFIGLLLLNASSLRGRRVSGIQH* leuL
If you really want to (you don’t really really want to), you can convert a DNAString back into normal text:
as.character(dnaSeqs[[1]][1:20])## [1] "ATGAAACGCATTAGCACCAC"
Finally, all Bioconductor packages have one or more manuals called vignettes that are installed alongside. To access a list of available vignettes (corresponding to all currently loaded packages), use the function:
openVignette()Then you can choose the number of the manual you would like to read, which should open as a PDF. For instance, you should see one called “Biostrings - Biostrings Quick Overview” that gives you a brief list of the functionality that Biostrings provides you with.
Note: all these exercises can be solved using exclusively functions covered in these practicals.
Using the dnaSeqs object:
?alphabetFrequency)table())barplot())To study the rpoA gene sequence in E. coli, I extracted the genomic region encoding rpoA but including an extra base at each end. The sequence is included in the following R code, ready to be copied into a script.
library(Biostrings)
rpoA <- DNAString("ATTACTCGTCAGCGATGCTTGCCGGTGGCCAGTTTTCCAGGCGCATGCCCAGAGACAGTCCACGGGAAGCCAGCACGTCTTTAATCTCAGTAAGAGATTTTTTACCAAGGTTAGGCGTTTTAAGGAGCTCAACCTCGGTACGCTGTACCAGATCACCGATATAGTGGATAGCTTCTGCTTTAAGGCAGTTAGCAGAGCGGACAGTCAATTCCAGATCGTCAACAGGGCGCAGCAGGATCGGATCGAACTCTGGTTTCTCTTCTTTCACTTCAGGCTGACGTACATCACGTAAGTCAACGAAAGCTTCCAGTTGTTCAGCCAGAATGGTTGCCGCACGACGAATCGCCTCTTCAGGATCGATTGTGCCGTTGGTTTCCATTTCGATGACCAGCTTGTCCAGGTCGGTACGCTGTTCTACACGCGCTGCTTCAACATTGTAGGCAATACGCTCCACAGGGCTGTAGCATGCGTCGACCAGCAGACGGCCGATTGGGCGCTCATCTTCTTCCGAATGAATTCGGGTAGAAGCCGGCACATAACCACGACCGCGCTGAACTTTGATACGCATGCTAATAGACGCGTTCTCATCGGTCAGGTGGCAGATCACGTGCTGCGGCTTGACGATTTCGACATCACCGTCGTGGGTGATATCGGCTGCAGTCACAGGGCCAATGCCAGATTTATTCAAGGTAAGAATAACTTCATCTTTGCCCTGAACTCTCACCGCCAGCCCTTTCAGGTTGAGCAGGATTTCCAGGATATCTTCCTGAACGCCTTCTTTGGTGCTGTACTCATGTAGTACACCATCAATCTCAACCTCGGTCACCGCGCAACCCGGCATCGATGAGAGCAGAATACGGCGCAGTGCGTTACCCAGAGTATGGCCAAAGCCACGCTCTAAAGGCTCAAGGGTCACCTTGGCGTGCGTCGAACTCACTTGCTCGATATCAACCAGGCGCGGTTTTAGAAACTCTGTCACAGAACCCTGCATA")Questions:
pie that shows the nucleotide fraction in the messenger RNA of rpoA