Thursday, July 24, 2014

Big Data solutions in genomics

Are big data solutions for genomics all about increasing processing speed / handling it's size? Perhaps the most immediate application of big data solutions (or in most of these cases, Hadoop) is optimization or distributed computing. Examples of these are:

  • Hadoop-BAM library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework
  • CloudDOE a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce
  • CloudBurst parallel read mapping 
  • DistMap short read mapping on a Hadoop cluster
  • SeqPig scalable scripting for large sequencing data sets in Hadoop
  • Crossbow  implementation of exisiting tools (Bowtie, SoapSNP) to run on a parallel cloud cluster
  • SeqAlto read alignment and resequencing
  • Myrna Cloud-scale differential gene expression for RNA-seq
These are great examples of processing NGS data quicker and over a cluster/cloud/parallel system.

Can it do more? I would really like to see (and also develop myself if I can!) genomics solutions using these principles for:
  • Statistical analysis
  • Network inference
  • Associations
  • Metagenomics
Perhaps it is a good time to also talk about the application of non-relational databases (NoSQL) solutions. Genomic data is sparsely distributed (at least to me, in terms of storing metadata as well). For example, if you look at chromatin-bound entities (ChIP-sequencing), not all have the same bound sites. Similarly, a gene expression experiment will store data differently than a DNA mutation one. They do however share similar data, and more often than not, we have to combine them.

 I leave you with the videos of the Big Data in Biomedicine Conference 2014, which talks a lot on what kind of big data is being generated, what kind of computer technology is available to provide solutions, and a little bit at the end on the statistics and machine learning (best part!).

Sunday, July 20, 2014

R: sequence logo from FASTA

# suppose you have some sequences in a string
fasta<-c("CAGGTCAGC","TGGGTGAGG","GCCGTGAGT","TTGGTAATT","AAGGTAGGA")
[1] "CAGGTCAGC" "TGGGTGAGG" "GCCGTGAGT" "TTGGTAATT" "AAGGTAGGA"
# if you happen to come from a DNAStringSet object,
#
## A DNAStringSet instance of length 49
## width seq
## [1] 9 CAGGTCAGC
## [2] 9 TGGGTGAGG
## [3] 9 GCCGTGAGT
## [4] 9 TTGGTAATT
## [5] 9 AAGGTAGGA
#
# You can do this to obtain a string vector representation:
# fasta<-as.character(dna)
#plot seq logo
plot_seqlogo <-function(fasta_string){
require(seqLogo)
require(Biostrings)
freq<-consensusMatrix(fasta_string,as.prob=T)[1:4,]
freq<-data.frame(freq)
seqLogo(makePWM(freq),ic.scale=FALSE) #ic.scale determines either frequency or bits
}
plot_seqlogo(fasta)
view raw seqlogo.R hosted with ❤ by GitHub

Wednesday, July 9, 2014

Extracting fasta from bed in R

Extracting fasta from bed (specifying chromosome location)

Currently, bedtools is not readily available in R, but if you have bedtools installed (Mac/Unix) and it is in your path (export your path if not), you can write a simple function in R to replicate the commandline task.
bedTools<-function(fstring="fastaFromBed",bedframe,fasta_file,opt="-s -tab"){
#create temp files
bed.file= tempfile()
out = tempfile()
# writing temporary files
write.table(bedframe,file=bed.file,quote=F,sep="\t",col.names=F,row.names=F)
# call the function
command=paste(fstring,opt,"-fi",fasta_file,"-bed",bed.file,"-fo",out,sep=" ")
cat(command,"\n")
try(system(command))
res=read.table(out,header=F,stringsAsFactors =FALSE)
unlink(bed.file);unlink(out)
return(res)
}
view raw bedtools.R hosted with ❤ by GitHub