onezerobio: July 2014

Thursday, July 24, 2014

Big Data solutions in genomics

Are big data solutions for genomics all about increasing processing speed / handling it's size? Perhaps the most immediate application of big data solutions (or in most of these cases, Hadoop) is optimization or distributed computing. Examples of these are:

Hadoop-BAM library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework
CloudDOE a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce
CloudBurst parallel read mapping
DistMap short read mapping on a Hadoop cluster
SeqPig scalable scripting for large sequencing data sets in Hadoop
Crossbow implementation of exisiting tools (Bowtie, SoapSNP) to run on a parallel cloud cluster
SeqAlto read alignment and resequencing
Myrna Cloud-scale differential gene expression for RNA-seq

These are great examples of processing NGS data quicker and over a cluster/cloud/parallel system.

Can it do more? I would really like to see (and also develop myself if I can!) genomics solutions using these principles for:

Statistical analysis
Network inference
Associations
Metagenomics

Perhaps it is a good time to also talk about the application of non-relational databases (NoSQL) solutions. Genomic data is sparsely distributed (at least to me, in terms of storing metadata as well). For example, if you look at chromatin-bound entities (ChIP-sequencing), not all have the same bound sites. Similarly, a gene expression experiment will store data differently than a DNA mutation one. They do however share similar data, and more often than not, we have to combine them.

I leave you with the videos of the Big Data in Biomedicine Conference 2014, which talks a lot on what kind of big data is being generated, what kind of computer technology is available to provide solutions, and a little bit at the end on the statistics and machine learning (best part!).

Wednesday, July 9, 2014

Extracting fasta from bed in R

Extracting fasta from bed (specifying chromosome location)

Currently, bedtools is not readily available in R, but if you have bedtools installed (Mac/Unix) and it is in your path (export your path if not), you can write a simple function in R to replicate the commandline task.

onezerobio

Thursday, July 24, 2014

Big Data solutions in genomics

Sunday, July 20, 2014

R: sequence logo from FASTA

Wednesday, July 9, 2014

Extracting fasta from bed in R