Friday, March 28, 2014

Version control systems

The type of version control system (Git vs SVN vs CVS vs Mercurial) used would likely not have much impact while working within a single lab environment (with one or two coders). I started thinking more about this when moving some of my more "public" code from an SVN-based Unfuddle over to GitHub. Git is more suited when working in a larger group, and then there are the concept or forking/pulling is great. The only thing stopping me from moving wholesale is the pricing option for private repos (Unfuddle is free). I also like that Bioconductor can bridge its SVN repo with Github (so I can share my code easily as well) and I also enjoy the whole user experience of the site.

As for now, I'm on both. I'm thinking of moving my private repos to BitBucket, which supports Git and Mercurial. The ability to work with a local repo (Git) while on the road is very useful.

Some further self-reading : http://stackoverflow.com/questions/871/why-is-git-better-than-subversion

Tuesday, March 18, 2014

Adding the optional IH tag to SAM files

One of the major complaints about the 2 most often used aligners BWA and Bowtie is its failure to report the NH or IH tag.

The IH tag is an indicator of the number of stored alignments in the SAM file that contains the current query (i.e. the read). This is meaningful for multi-mapped reads if you want to know to how many locations the same read has been mapped (eg. assuming your Bowtie parameter "k" has been set to more than 1).

I've written an awk oneliner that will add this tag to your SAM file. What it does is to iterate the file twice, first to tabulate counts, and second to write the extra tag.

Saturday, March 15, 2014

Fixing svn in RStudio (Mac OS)

After updating Rstudio and R to version 3.0.3, I lost the "svn" option under version control.

Rstudio started with these messages (a clue to fixing the problem!)
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"
The Internationalization of the R.app was causing this problem and a simple
system("defaults write org.R-project.R force.LANG en_US.UTF-8")
on the R command line and restarting Rstudio was all that was needed.

Thursday, March 6, 2014

Filtering FASTQ files for unique reads

Filtering for duplicate reads in fastq files may be important if your application requires considering unique entries for counting etc.

Brent Pederson wrote a very quick script utilizing Bloom filters for this purpose (read more at : http://hackmap.blogspot.sg/2010/10/bloom-filter-ing-repeated-reads.html). The installation process might not be clear for those not familiar with code, so I'll try and explain the process step-by-step here.

To run the fastq_unique.py script, you'ld need three things:
  1. Perl module Bloom Faster
    • either install through cpan or manual download
  2. Python module nose (pybloomfaster tests)
    • installation directions on the nose page
  3. Brent's wrapper pybloomfaster
    • download the master zip
    • sudo python setup.py install
       
       

Installing python modules with setuptools or pip

Remember to set your http (and https) proxy!


Running into errors like this:
sudo pip install nose
Cannot fetch index base URL http://pypi.python.org/simple/

Or this:
sudo easy_install nose
Scanning index of all packages (this may take a while)
Reading http://pypi.python.org/simple/
Download error: [Errno -2] Name or service not known -- Some packages may not be found!

Is simply a matter of setting your http proxy because PYPI redirects to https. Check your environment by:
env | grep -i http

If it returns empty, nothing has been set. Set them using:
set http_proxy=http://localhost:8080
set https_proxy=http://localhost:8080

And it should now work.