- Hadoop-BAM library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework
- CloudDOE a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce
- CloudBurst parallel read mapping
- DistMap short read mapping on a Hadoop cluster
- SeqPig scalable scripting for large sequencing data sets in Hadoop
- Crossbow implementation of exisiting tools (Bowtie, SoapSNP) to run on a parallel cloud cluster
- SeqAlto read alignment and resequencing
- Myrna Cloud-scale differential gene expression for RNA-seq
Can it do more? I would really like to see (and also develop myself if I can!) genomics solutions using these principles for:
- Statistical analysis
- Network inference
- Associations
- Metagenomics
I leave you with the videos of the Big Data in Biomedicine Conference 2014, which talks a lot on what kind of big data is being generated, what kind of computer technology is available to provide solutions, and a little bit at the end on the statistics and machine learning (best part!).