Pages

Sunday, 24 July 2016

Transcripts

Get transcripts data from Bioconductor

In this example, I will extract the 3’UTR length from all mouse RefSeq transcripts

I will use the GenomicRanges library from Bioconductor to extract the 3’UTRs information. Also, I will use the dplyr library to handle the data.

library(GenomicFeatures)
library(dplyr)

Now, we need to load the data.

#refSeq             <- makeTxDbFromUCSC(genom="mm10",tablename="refGene")

Since the function does not work at the moment (apparently something was changed in UCSC table), I will load the data from a file. You download the data for the example by clicking here.

refseq             <- loadDb("mm10_refseq.sqlite")

Now we get the 3’UTRs

threeUTRs          <- threeUTRsByTranscript(refseq, use.names=TRUE)
length_threeUTRs   <- width(ranges(threeUTRs))

We put it all together in a dataframe

the_lengths        <- as.data.frame(length_threeUTRs)
the_lengths        <- the_lengths %>% group_by(group, group_name) %>% summarise(sum(value))
the_lengths        <- unique(the_lengths[,c("group_name", "sum(value)")])
colnames(the_lengths) <- c("RefSeq Transcript", "3' UTR Length")

The data is in the the_lengths data frame

## # A tibble: 10 x 2
##    RefSeq Transcript 3' UTR Length
##                <chr>         <int>
## 1          NM_008866          1719
## 2       NM_001159750          1545
## 3          NM_011541          1545
## 4       NM_001159751          1545
## 5       NM_001310442           384
## 6          NM_133826           384
## 7       NM_001204371          3349
## 8       NM_001318735          3262
## 9          NM_011011          3349
## 10         NM_009826          1829

We can save the data for later

write.csv(the_lengths, "the_lengths.csv")

And we can get it back

again <- read.csv("the_lengths.csv")

Done!