Friday, October 24, 2014

Migrating from Unfuddle (Subversion) to BitBucket (Git)

Tried to use the script provided by Atlassian but it failed as the default Unfuddle setup does not have the trunk/branch/tag structure.

Used svn2git instead with the following steps:

1. Create an authors.txt to map svn to git usernames (check https://github.com/nirvdrum/svn2git for instructions)

2. Run svn2git
svn2git -v http://svn.example.com/path/to/repo --username dlow --trunk / --nobranches --notags --authors authors.txt

3. Create repo on Bitbucket

4. Sync your repo
git remote add origin https://user@bitbucket.org/user/your_git_repo.git
git push -u origin --all


This should preserve all your previous commits from svn over to BitBucket (or Git in general).

If you use Eclipse as your IDE, the following video is useful for setting up EGit and going through some basics of Git.


Wednesday, October 8, 2014

Choosing colors

While not specifically for R, ColorBrewer gives you an excellent reference for color range for different number of bins as well. Fantastic for heatmaps.

Wednesday, August 27, 2014

Machine learning (feature selection) in R

I have been taking a couple of machine learning classes on Coursera (Johns Hopkins - Practical Machine Learning, WashU - Introduction to Data Science, waiting for Stanford's course to be offered again!) and I find the following post to be very useful in explaining in detail the mechanics of feature selection.

http://topepo.github.io/caret/featureselection.html

It explains various methods and algorithms for feature selection in R. The caret package is particularly useful as it conveniently unifies different methods under a single function wrapper, train.

Thursday, July 24, 2014

Big Data solutions in genomics

Are big data solutions for genomics all about increasing processing speed / handling it's size? Perhaps the most immediate application of big data solutions (or in most of these cases, Hadoop) is optimization or distributed computing. Examples of these are:

  • Hadoop-BAM library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework
  • CloudDOE a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce
  • CloudBurst parallel read mapping 
  • DistMap short read mapping on a Hadoop cluster
  • SeqPig scalable scripting for large sequencing data sets in Hadoop
  • Crossbow  implementation of exisiting tools (Bowtie, SoapSNP) to run on a parallel cloud cluster
  • SeqAlto read alignment and resequencing
  • Myrna Cloud-scale differential gene expression for RNA-seq
These are great examples of processing NGS data quicker and over a cluster/cloud/parallel system.

Can it do more? I would really like to see (and also develop myself if I can!) genomics solutions using these principles for:
  • Statistical analysis
  • Network inference
  • Associations
  • Metagenomics
Perhaps it is a good time to also talk about the application of non-relational databases (NoSQL) solutions. Genomic data is sparsely distributed (at least to me, in terms of storing metadata as well). For example, if you look at chromatin-bound entities (ChIP-sequencing), not all have the same bound sites. Similarly, a gene expression experiment will store data differently than a DNA mutation one. They do however share similar data, and more often than not, we have to combine them.

 I leave you with the videos of the Big Data in Biomedicine Conference 2014, which talks a lot on what kind of big data is being generated, what kind of computer technology is available to provide solutions, and a little bit at the end on the statistics and machine learning (best part!).

Wednesday, July 9, 2014

Extracting fasta from bed in R

Extracting fasta from bed (specifying chromosome location)

Currently, bedtools is not readily available in R, but if you have bedtools installed (Mac/Unix) and it is in your path (export your path if not), you can write a simple function in R to replicate the commandline task.

Tuesday, June 17, 2014

Data Analysis in R : PCA

Good article on principle component analysis in R, how to do feature selection, etc. "Big data" is a hot topic at the moment, and it's relevance in bioinformatics cannot be ignored.

http://www.r-bloggers.com/introduction-to-feature-selection-for-bioinformaticians-using-r-correlation-matrix-filters-pca-backward-selection/