Thursday, July 24, 2014

Big Data solutions in genomics

Are big data solutions for genomics all about increasing processing speed / handling it's size? Perhaps the most immediate application of big data solutions (or in most of these cases, Hadoop) is optimization or distributed computing. Examples of these are:

  • Hadoop-BAM library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework
  • CloudDOE a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce
  • CloudBurst parallel read mapping 
  • DistMap short read mapping on a Hadoop cluster
  • SeqPig scalable scripting for large sequencing data sets in Hadoop
  • Crossbow  implementation of exisiting tools (Bowtie, SoapSNP) to run on a parallel cloud cluster
  • SeqAlto read alignment and resequencing
  • Myrna Cloud-scale differential gene expression for RNA-seq
These are great examples of processing NGS data quicker and over a cluster/cloud/parallel system.

Can it do more? I would really like to see (and also develop myself if I can!) genomics solutions using these principles for:
  • Statistical analysis
  • Network inference
  • Associations
  • Metagenomics
Perhaps it is a good time to also talk about the application of non-relational databases (NoSQL) solutions. Genomic data is sparsely distributed (at least to me, in terms of storing metadata as well). For example, if you look at chromatin-bound entities (ChIP-sequencing), not all have the same bound sites. Similarly, a gene expression experiment will store data differently than a DNA mutation one. They do however share similar data, and more often than not, we have to combine them.

 I leave you with the videos of the Big Data in Biomedicine Conference 2014, which talks a lot on what kind of big data is being generated, what kind of computer technology is available to provide solutions, and a little bit at the end on the statistics and machine learning (best part!).

Wednesday, July 9, 2014

Extracting fasta from bed in R

Extracting fasta from bed (specifying chromosome location)

Currently, bedtools is not readily available in R, but if you have bedtools installed (Mac/Unix) and it is in your path (export your path if not), you can write a simple function in R to replicate the commandline task.