Thursday, March 6, 2014

Filtering FASTQ files for unique reads

Filtering for duplicate reads in fastq files may be important if your application requires considering unique entries for counting etc.

Brent Pederson wrote a very quick script utilizing Bloom filters for this purpose (read more at : http://hackmap.blogspot.sg/2010/10/bloom-filter-ing-repeated-reads.html). The installation process might not be clear for those not familiar with code, so I'll try and explain the process step-by-step here.

To run the fastq_unique.py script, you'ld need three things:
  1. Perl module Bloom Faster
    • either install through cpan or manual download
  2. Python module nose (pybloomfaster tests)
    • installation directions on the nose page
  3. Brent's wrapper pybloomfaster
    • download the master zip
    • sudo python setup.py install
       
       

No comments:

Post a Comment