onezerobio: 2014

Friday, October 24, 2014

Migrating from Unfuddle (Subversion) to BitBucket (Git)

Tried to use the script provided by Atlassian but it failed as the default Unfuddle setup does not have the trunk/branch/tag structure.

Used svn2git instead with the following steps:

1. Create an authors.txt to map svn to git usernames (check https://github.com/nirvdrum/svn2git for instructions)

2. Run svn2git
svn2git -v http://svn.example.com/path/to/repo --username dlow --trunk / --nobranches --notags --authors authors.txt

3. Create repo on Bitbucket

4. Sync your repo

git remote add origin https://user@bitbucket.org/user/your_git_repo.git 

git push -u origin --all

This should preserve all your previous commits from svn over to BitBucket (or Git in general).

If you use Eclipse as your IDE, the following video is useful for setting up EGit and going through some basics of Git.

Wednesday, October 8, 2014

Choosing colors

While not specifically for R, ColorBrewer gives you an excellent reference for color range for different number of bins as well. Fantastic for heatmaps.

Wednesday, August 27, 2014

Machine learning (feature selection) in R

I have been taking a couple of machine learning classes on Coursera (Johns Hopkins - Practical Machine Learning, WashU - Introduction to Data Science, waiting for Stanford's course to be offered again!) and I find the following post to be very useful in explaining in detail the mechanics of feature selection.

http://topepo.github.io/caret/featureselection.html

It explains various methods and algorithms for feature selection in R. The caret package is particularly useful as it conveniently unifies different methods under a single function wrapper, train.

Thursday, July 24, 2014

Big Data solutions in genomics

Are big data solutions for genomics all about increasing processing speed / handling it's size? Perhaps the most immediate application of big data solutions (or in most of these cases, Hadoop) is optimization or distributed computing. Examples of these are:

Hadoop-BAM library for the scalable manipulation of aligned next-generation sequencing data in the Hadoop distributed computing framework
CloudDOE a user-friendly tool for deploying Hadoop clouds and analyzing high-throughput sequencing data with MapReduce
CloudBurst parallel read mapping
DistMap short read mapping on a Hadoop cluster
SeqPig scalable scripting for large sequencing data sets in Hadoop
Crossbow implementation of exisiting tools (Bowtie, SoapSNP) to run on a parallel cloud cluster
SeqAlto read alignment and resequencing
Myrna Cloud-scale differential gene expression for RNA-seq

These are great examples of processing NGS data quicker and over a cluster/cloud/parallel system.

Can it do more? I would really like to see (and also develop myself if I can!) genomics solutions using these principles for:

Statistical analysis
Network inference
Associations
Metagenomics

Perhaps it is a good time to also talk about the application of non-relational databases (NoSQL) solutions. Genomic data is sparsely distributed (at least to me, in terms of storing metadata as well). For example, if you look at chromatin-bound entities (ChIP-sequencing), not all have the same bound sites. Similarly, a gene expression experiment will store data differently than a DNA mutation one. They do however share similar data, and more often than not, we have to combine them.

I leave you with the videos of the Big Data in Biomedicine Conference 2014, which talks a lot on what kind of big data is being generated, what kind of computer technology is available to provide solutions, and a little bit at the end on the statistics and machine learning (best part!).

Sunday, July 20, 2014

R: sequence logo from FASTA

Wednesday, July 9, 2014

Extracting fasta from bed in R

Extracting fasta from bed (specifying chromosome location)

Currently, bedtools is not readily available in R, but if you have bedtools installed (Mac/Unix) and it is in your path (export your path if not), you can write a simple function in R to replicate the commandline task.

Tuesday, June 17, 2014

Data Analysis in R : PCA

Good article on principle component analysis in R, how to do feature selection, etc. "Big data" is a hot topic at the moment, and it's relevance in bioinformatics cannot be ignored.

http://www.r-bloggers.com/introduction-to-feature-selection-for-bioinformaticians-using-r-correlation-matrix-filters-pca-backward-selection/

Monday, June 2, 2014

Coursera course review : Statistical Inference

Statistical Inference

This is a concise 4 week introductory course to basic statistics concepts, and how to perform statistical inference, offered by Johns Hopkins University. You will learn about sets, distributions, confidence intervals and multiple testing.

You do go through a fair bit of material in such a short time, and you would probably benefit with extra reading outside the course especially if you have not taken any statistics class before. The topics covered in this class is approximately the mixture of Biostatistics Bootcamp 1 and a small chunk from Biostatistics Bootcamp 2, both of which are 8 week courses offered by the same department. Having taken them before this class, I felt it was a nice summary and refresher course for me, and the extended time spent on the longer courses was very helpful.

Having said that, the Statistical Inference class that I took was its first run, and further installations would very likely improve in subsequent offerings.

Verdict: If you have not taken any stats courses before, perhaps it would be beneficial to spend the extra time on the two Bootcamp courses.

Tuesday, May 27, 2014

Installing Shiny server

Monday, May 26, 2014

Coursera course review : Interactive Python Programming

An Introduction to Interactive Programming in Python

Although it says "introduction in the title", do not be misled to think that it is an introduction to Python programming, rather, it is introducing *interactive* Python programming. Reading through the course forum, many beginners found it hard to grasp concepts fast enough to be able to do the weekly coding mini-projects (my background : I use Python everyday at work for years. I would say that if you are not completely clueless about programming, and have at least some encounter with coding concepts, this course is manageable). Having said that, the forum is a great resource and there are lots of people willing to help you out with coding problems and point you in the right direction.

You write and execute your code on CodeSkulptor (made by Scott, one of the course instructors). Great tool to eliminate the need for installs and there are custom modules created to take away the nitty gritty parts of graphical programming - and that helps lessens the frustration of potentially dealing with more technical issues.

What attracted me most to this course is the application of the concepts learned. Each week, students will code a game. You start text-based, and at the end you code a graphical game with sound and animation!

This course is a good example of applied coding skills, and the course instructors were very good in presenting concepts and helping with framing the projects, and kick-starting the enthusiasm to complete them every week.

Verdict: Highly recommended, and lots of fun.

Wednesday, May 21, 2014

Useful Javascript functions!

Personal list of useful functions:

Converting a Javascript array to a function arguments list
Very important for writing short code when passing values to another function (exploding the list).

Tuesday, May 6, 2014

Adding column entries to DataFrame objects

The DataFrame class comes from the IRanges package and is used by several other packages (eg. Rsamtools) to read in bam files.

Simply cbind-ing a column to the DataFrame object will change it to the more conventional data.frame class.

So, instead of
cbind(object1,newcolumn)

do:
cbind(object1,DataFrame(newcolumn))

to ensure that the result is still in the DataFrame class.

Monday, May 5, 2014

Subsetting/ extracting reads from bam files

1. Regions (make an index first, then select your regions)
samtools index bam_file
samtools view -b file.bam chr1 > file_chr1.bam
samtools view -b file.bam chr1:1-10000 > file_chr1_1_10000.bam

2. Regions given in bed file
samtools view -bL regions.bed file.bam > file_regions.bam

3. First n reads
samtools view -h file.bam | head -n 10000 | samtools view -bS - > file_n10000.bam

Monday, April 21, 2014

Book review : Programming the Semantic Web

I picked up this book wanting to learn more on NoSQL that I had to implement in developing some packages. This book serves well as a beginners guide : why we need semantics, and compares current ideas and structure of SQL databases to the more flexible and abstract forms that it is trying to promote.

For me it was written in a very logical progression and concepts were introduced in a very good pace. The book goes on to talk about RDF and OWL together with how they can be implemented (O'Reilly has another book Learning SPARQL which may be a logical progressing after this book). No doubt it is not going to make you a pro on the subject, but it was sufficient for me to understand the basics of semantic data structures.

Few things you might want to take note:

this book uses Python, so a basic understanding would be very helpful to understand how things are implemented
the website semprog.com is not available anymore, but you can get the code package at http://examples.oreilly.com/9780596153823/Programming-the-Semantic-Web.zip

Overall, a good introductory book.

PS: If you don't like to read, youtube now has a lot of channels talking about NoSQL and semantics including O'Reilly themselves, so you might want to check that out.

Friday, March 28, 2014

Version control systems

The type of version control system (Git vs SVN vs CVS vs Mercurial) used would likely not have much impact while working within a single lab environment (with one or two coders). I started thinking more about this when moving some of my more "public" code from an SVN-based Unfuddle over to GitHub. Git is more suited when working in a larger group, and then there are the concept or forking/pulling is great. The only thing stopping me from moving wholesale is the pricing option for private repos (Unfuddle is free). I also like that Bioconductor can bridge its SVN repo with Github (so I can share my code easily as well) and I also enjoy the whole user experience of the site.

As for now, I'm on both. I'm thinking of moving my private repos to BitBucket, which supports Git and Mercurial. The ability to work with a local repo (Git) while on the road is very useful.

Some further self-reading : http://stackoverflow.com/questions/871/why-is-git-better-than-subversion

Tuesday, March 18, 2014

Adding the optional IH tag to SAM files

One of the major complaints about the 2 most often used aligners BWA and Bowtie is its failure to report the NH or IH tag.

The IH tag is an indicator of the number of stored alignments in the SAM file that contains the current query (i.e. the read). This is meaningful for multi-mapped reads if you want to know to how many locations the same read has been mapped (eg. assuming your Bowtie parameter "k" has been set to more than 1).

I've written an awk oneliner that will add this tag to your SAM file. What it does is to iterate the file twice, first to tabulate counts, and second to write the extra tag.

Saturday, March 15, 2014

Fixing svn in RStudio (Mac OS)

After updating Rstudio and R to version 3.0.3, I lost the "svn" option under version control.

Rstudio started with these messages (a clue to fixing the problem!)

During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"

The Internationalization of the R.app was causing this problem and a simple

system("defaults write org.R-project.R force.LANG en_US.UTF-8")

on the R command line and restarting Rstudio was all that was needed.

Thursday, March 6, 2014

Filtering FASTQ files for unique reads

Filtering for duplicate reads in fastq files may be important if your application requires considering unique entries for counting etc.

Brent Pederson wrote a very quick script utilizing Bloom filters for this purpose (read more at : http://hackmap.blogspot.sg/2010/10/bloom-filter-ing-repeated-reads.html). The installation process might not be clear for those not familiar with code, so I'll try and explain the process step-by-step here.

To run the fastq_unique.py script, you'ld need three things:

Perl module Bloom Faster

either install through cpan or manual download

Python module nose (pybloomfaster tests)

installation directions on the nose page

Brent's wrapper pybloomfaster

download the master zip
```
sudo python setup.py install
```
```
 
```
```
 
```

Installing python modules with setuptools or pip

Remember to set your http (and https) proxy!

Running into errors like this:

sudo pip install nose
Cannot fetch index base URL http://pypi.python.org/simple/

Or this:

sudo easy_install nose
Scanning index of all packages (this may take a while)
Reading http://pypi.python.org/simple/
Download error: [Errno -2] Name or service not known -- Some packages may not be found!

Is simply a matter of setting your http proxy because PYPI redirects to https. Check your environment by:

env | grep -i http

If it returns empty, nothing has been set. Set them using:

set http_proxy=http://localhost:8080
set https_proxy=http://localhost:8080

And it should now work.

Monday, March 3, 2014

Designing user interfaces for biological data (and for biologists!)

Here's a nice slide presentation out of the VIZBI conference regarding UI design and considerations.

http://www.slideshare.net/francisrowlanduk/vizbi-2013-ux-design-tutorial

Thursday, February 27, 2014

Merging RNA-seq replicates from Tophat

Sometimes, you might want to present your replicates as one consensus sequencing track on viewers like IGV or IGB.

To merge bam files : Use samtools merge.
To merge junctions.bed files : Use the tbedjuncs and combineJuncs scripts at seqscripts

cat junctions_file1.bed junction_file_2.bed junction_file_3.bed | tbed2juncs | combineJuncs > combined.junctions.bed

Ref: http://seqanswers.com/forums/showthread.php?t=3381

Wednesday, February 26, 2014

Converting BAM files to BigWig format

One of the reasons to convert bam to BigWig is to get a normalized coverage in viewers such as IGB.

Tools needed

samtools (if your bam file is not sorted by position), bedtools and ucsc's bedGraphToBigWig in the ucsc command line utilities

Procedure

samtools sort input.bam input.sorted

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo" > hg19.genome

bedtools genomecov -ibam input.sorted.bam -bg hg19.genome > input.sorted.bedGraph

bedGraphToBigWig input.sorted.bedGraph hg19.genome input.sorted.bigwig

Unfortunately we can't pipe genomecoverage directly into bedGraphToBigWig because bedGraphToBigWig does a few passes over the file (for improved speed).

Other things to consider:
1) Scaling : setting a scale factor (within bedtools) would normalize your bedgraph/bigwig files (the default=1, no scaling). I generally scale all files to say, 10 million reads (get total reads from eg. samtools flagstat output) so that they can be consistently used throughout different comparisons.
2) -split option should be used for RNA-seq to properly represent spliced reads.
3) -strand option in theory would give you strand-specific files (meaning you'll have a +ve and -ve set) but for my paired-end directional reads, it seems to only work for the first in read pair. I've left out this option until I figure out a solution.

Wednesday, February 19, 2014

R Shiny + qPCR calculator

The problem with R (and many other platform that involves some level of coding/commands for execution) for biologists is : how do I run/use a package? How do we, as developers, "package" them for mass consumption? Several years ago, I started with creating GUI interfaces to provide a point and click platform for my scripts in R, but this also requires the user to install R, have the relevant libraries installed.

Shiny takes care of that, putting your scripts on the web, and most importantly, frees the user from having to install anything.

As an exercise on building apps, I've created a simple qPCR calculator and plot generator.

More good things to come.

Monday, February 17, 2014

VNC (remote desktop) on Ubuntu

Server-side setup

Official Ubuntu help on VNC here : https://help.ubuntu.com/community/VNC/Servers

There are many flavours of VNC (server-side) in Linux, and the default one used by Ubuntu's Remote Desktop is vino (get it here https://apps.ubuntu.com/cat/applications/precise/vino/ if it is not installed)

Another option would be x11vnc, which offers more customizable options, if you need any.

sudo apt-get install x11vnc

x11vnc -usepw

Which will then give you the port to connect to on the client side.

e2@e2-ubuntu:~$ x11vnc -usepw
18/02/2014 08:55:41 -usepw: found /home/e2/.vnc/passwd
18/02/2014 08:55:41 x11vnc version: 0.9.12 lastmod: 2010-09-09  pid: 37014
18/02/2014 08:55:41 Using X display :0
18/02/2014 08:55:41 rootwin: 0x150 reswin: 0x5200001 dpy: 0x1641590
18/02/2014 08:55:41 
18/02/2014 08:55:41 ------------------ USEFUL INFORMATION ------------------
18/02/2014 08:55:41 X DAMAGE available on display, using it for polling hints.
18/02/2014 08:55:41   To disable this behavior use: '-noxdamage'
18/02/2014 08:55:41 
18/02/2014 08:55:41   Most compositing window managers like 'compiz' or 'beryl'
18/02/2014 08:55:41   cause X DAMAGE to fail, and so you may not see any screen
18/02/2014 08:55:41   updates via VNC.  Either disable 'compiz' (recommended) or
18/02/2014 08:55:41   supply the x11vnc '-noxdamage' command line option.
18/02/2014 08:55:41 
18/02/2014 08:55:41 Wireframing: -wireframe mode is in effect for window moves.
18/02/2014 08:55:41   If this yields undesired behavior (poor response, painting
18/02/2014 08:55:41   errors, etc) it may be disabled:
18/02/2014 08:55:41    - use '-nowf' to disable wireframing completely.
18/02/2014 08:55:41    - use '-nowcr' to disable the Copy Rectangle after the
18/02/2014 08:55:41      moved window is released in the new position.
18/02/2014 08:55:41   Also see the -help entry for tuning parameters.
18/02/2014 08:55:41   You can press 3 Alt_L's (Left "Alt" key) in a row to 
18/02/2014 08:55:41   repaint the screen, also see the -fixscreen option for
18/02/2014 08:55:41   periodic repaints.
18/02/2014 08:55:41 
18/02/2014 08:55:41 XFIXES available on display, resetting cursor mode
18/02/2014 08:55:41   to: '-cursor most'.
18/02/2014 08:55:41   to disable this behavior use: '-cursor arrow'
18/02/2014 08:55:41   or '-noxfixes'.
18/02/2014 08:55:41 using XFIXES for cursor drawing.
18/02/2014 08:55:41 GrabServer control via XTEST.
18/02/2014 08:55:41 
18/02/2014 08:55:41 Scroll Detection: -scrollcopyrect mode is in effect to
18/02/2014 08:55:41   use RECORD extension to try to detect scrolling windows
18/02/2014 08:55:41   (induced by either user keystroke or mouse input).
18/02/2014 08:55:41   If this yields undesired behavior (poor response, painting
18/02/2014 08:55:41   errors, etc) it may be disabled via: '-noscr'
18/02/2014 08:55:41   Also see the -help entry for tuning parameters.
18/02/2014 08:55:41   You can press 3 Alt_L's (Left "Alt" key) in a row to 
18/02/2014 08:55:41   repaint the screen, also see the -fixscreen option for
18/02/2014 08:55:41   periodic repaints.
18/02/2014 08:55:41 
18/02/2014 08:55:41 XKEYBOARD: number of keysyms per keycode 7 is greater
18/02/2014 08:55:41   than 4 and 51 keysyms are mapped above 4.
18/02/2014 08:55:41   Automatically switching to -xkb mode.
18/02/2014 08:55:41   If this makes the key mapping worse you can
18/02/2014 08:55:41   disable it with the "-noxkb" option.
18/02/2014 08:55:41   Also, remember "-remap DEAD" for accenting characters.
18/02/2014 08:55:41 
18/02/2014 08:55:41 X FBPM extension not supported.
18/02/2014 08:55:41 X display is capable of DPMS.
18/02/2014 08:55:41 --------------------------------------------------------
18/02/2014 08:55:41 
18/02/2014 08:55:41 Default visual ID: 0x21
18/02/2014 08:55:41 Read initial data from X display into framebuffer.
18/02/2014 08:55:41 initialize_screen: fb_depth/fb_bpp/fb_Bpl 24/32/7680
18/02/2014 08:55:41 
18/02/2014 08:55:41 X display :0 is 32bpp depth=24 true color
18/02/2014 08:55:41 
18/02/2014 08:55:41 Autoprobing TCP port 
18/02/2014 08:55:41 Autoprobing selected port 5903
18/02/2014 08:55:41 Listening also on IPv6 port 5903 (socket 11)
18/02/2014 08:55:41 
18/02/2014 08:55:41 Xinerama is present and active (e.g. multi-head).
18/02/2014 08:55:41 Xinerama: number of sub-screens: 1
18/02/2014 08:55:41 Xinerama: no blackouts needed (only one sub-screen)
18/02/2014 08:55:41 
18/02/2014 08:55:42 fb read rate: 64 MB/sec
18/02/2014 08:55:42 The X server says there are 16 mouse buttons.
18/02/2014 08:55:42 screen setup finished.
18/02/2014 08:55:42 

The VNC desktop is:      e2-ubuntu:3
PORT=5903

Client-side setup

to be continued

Thursday, February 13, 2014

Setting up GNomEx on a Synology NAS

GNomEx is data management and repository catered toward genomics (microarray, high thoughput sequencing, etc) which I found very useful during my collaborations with the Huntsman Cancer Institute in Utah.

I figured it would also be useful for our lab and collabs albeit on a much smaller scale. We have a Synology-based NAS setup here, which is useful since it supports easy installation for most of the needed applications.

GnomEx version installed : 5.16.9

Packages needed on Synology:

Java (as of this post, Synology handles auto-installation of ver 1.6.0_38)
phpMyAdmin (you'ld also need to enable MySQL through Control Panel - Web Services)

The Apache Tomcat package on Synology is still on version 6, so I installed 7 manually. *update* the latest DSM upgrade now offers Tomcat 7, so you don't have to do a manual installation.

You should be able to access the Tomcat page at the default http://yourserver:8080

To ensure Tomcat starts up automatically with your Synology box, create a script for autostart according to your Synology model (this one is for DS2413+).
The rest of the GnomEx instructions at http://hci-scrum.hci.utah.edu/gnomexdoc/?page_id=102 are pretty straightforward. Notes below on what I did differently.

JDBC driver (jar file) was copied into $CATALINA_HOME/lib/
All mySQL setup was done through phpMyAdmin.

As admin, create both users (gnomex and gnomexGuest)
Log out and login as gnomex
Create a database called gnomex
Paste SQL and execute SQL code from gnomex_db_ddl.sql and then gnomex_db_populate.sql.

Only GnomEx "Super Admin" seems to be able to edit dictionaries once you enter the interface, so I kept the original admin account and changed the password.