Multiprocessing architectures in Python

I am currently working on contributing code to create a multi-threaded version of a piece of bioinformatics software I use heavily, Cutadapt (http://cutadapt.readthedocs.org/en/stable/guide.html). Cutadapt is a Python program that reads through records in a FASTQ file (or pair of FASTQ files) and performs adapter and quality trimming, so the architecture of the program is “read sequentially from one or more files, modify the data, and write the results to one or more output files.” The question came up as to what the best way is to implement parallel processing in Python.

From prior experience with parallel processing, I know that the file reading and writing cannot be parallelized easily, so the general architecture is to have a reader thread that reads from the input file(s), worker threads that perform the read and quality trimming, and a writer thread that accumulates the results from the worker threads and writes them to disk.

From prior Python parallel processing experience, I know that the Global Interpreter Lock (GIL) is a major impediment, and the easiest way around it is to used process-based parallelism, as is implemented in the multiprocessing module. But multiprocessing still has a lot of flexibility as to how parallelism should actually be implemented. After doing some reading and experimenting, I determined that the two best approaches would be:

  1. Have a queue for passing input reads between the reader and the workers, and another queue for passing trimmed reads from the workers to the writer.
  2. Use a group of shared-memory Arrays to store the raw bytes from the input file. Have a worker parse the data in an Array, perform the trimming, convert back to bytes, and write back to the shared-memory Array. The writer thread would then read from the Array and write the bytes to disk. Queues are used to communicate between the threads which Arrays are in which states.

In implementing these approaches, I collected groups of reads into chunks of 1000 and passed those to the worker threads, rather than passing individual reads.

Intuitively, it seemed like option #2 would prove to be dramatically faster; however, this was not the case. In fact, there was no significant difference in run time between the two approaches. Given that option #1 is a lot easier to write, easier to read, and less bug-prone, there is no reason to use shared-memory Arrays for this class of problem.

Another thing I did that I haven’t seen elsewhere is use a shared integer Value as a control for the sub-processes. The Value starts out being 0. In the event that the main process encounters an error or the user cancels the program (Ctrl-C), the main process sets the control value to -1, which is a signal to the sub-processes to immediately terminate. If the main process completes successfully, it changes the control value to be a positive number - the number of chunks that it has placed on the input queue. If the workers ever find the input queue empty and the control value set to a positive number, they know to terminate. The writer thread will run until it has processed a number of chunks equal to the control value.

The code for these two different implementations can be found here: https://gist.github.com/jdidion/c65905b340ed4da7e40f

If you liked this post, you can share it with your followers or follow me on Twitter!

Quantum Craps

It’s said that god doesn’t play dice with the universe, but what if she instead simultaneously rolls an infinite number of infinitely-sided dice and chooses the subset of dice that represent the outcome she prefers? I think most people would call this cheating; yet if she rolled only a single set of dice and got the same outcome by chance, it would instead be called luck. Is there any substantial difference between the two situations?

Let me put this philosophical discussion in context. Hypothetically, I have a relatively large number of biological samples (e.g. 1000). These samples are both quantity-limited and precious (i.e. I can’t get any more if I run out). I want to run an assay on these samples. This assay gets run in batches of 92 samples. Typically, that assay requires a relatively large amount of input material, but my samples are so precious that I am not willing to use up the typical amount of material. Instead, I run a pilot project with a single batch of my samples that have large quantities of DNA available to determine that the assay generates good results with lower input (which it does). Before I have a chance to run the assay on the rest of my samples, a new and much improved version of the assay comes out. In the meantime, I have realized that my original experimental design was a poor one, because it confounds several covariates (individual and tissue type) with assay batch. I decide that I want to now run version 2 of the assay on all of my samples using a randomized design. However, version 2 of the assay is so new that the company I have contracted to run the assay has not tested it. They claim that since the chemistry and workflow are identical to version 1 of the assay, it should work identically well. And here is my skeptical face:

In any case, they’re not willing to do any validation prior to running our samples. They are also not willing to allow us to run a smaller batch of samples (or, rather, they would, but would still charge us for a full batch). We could wait for someone else to be the Guinea pig, but that would unacceptably delay the timeframe of the experiment.

The solution I arrived at is as follows:

  1. Randomize the samples a large number of times (100,000).
  2. Chose the randomization that maximizes the number of samples from my original pilot experiment that I can run in a single batch.
  3. Also use a control sample (a well-studied cell line with lots of existing data from version 1 of the assay). Add two technical replicates of the control to each batch, to allow me to regress out batch effects during data processing.
  4. Run the assay on the first batch of samples, and wait for the data to perform validation against my pilot data.
  5. Assuming everything goes well, run the assay on the remaining samples.

So, yes, I’m definitely cheating in steps 1-2. But is this form of cheating something that will compromise the integrity of my data? Is it something on which the reviewers of my paper or future grants will call me out? Should I feel bad about myself for doing this?

If you liked this post, you can share it with your followers or follow me on Twitter!

Mining Journal Article Recommendations from Twitter Feeds

Twitter is great for learning about exciting new journal articles that are relavant to your field. I follow many scientists who are active on Twitter and have similar research interests to mine. The problem is that keeping up with my Twitter feed, and separating the signal from the noise, can be overwhelming.

What I want is an automated way to mine my Twitter feed for journal articles. I also don’t want to have to worry about hosting it myself, and I don’t want to have to code a complicated solution if I can avoid it (I do enough of that at work).

A half-way solution

The simplest, but least elegant, solution for me was to create a Twitter list of all the scientists I want to follow for paper recommendations. I then added a column in TweetDeck dedicated to that list, and I filtered it to only show tweets containing links. I tried to also filter the list to only see tweets containing “paper” OR “article,” but there appears to be a bug in TweetDeck causing boolean operators not to work, at least in this circumstance.

A better solution

I use RSS feeds heavily, so an ideal solution would convert my Twitter list to an RSS feed, then filter it to include only tweets containing links (and maybe even filter further to only include links to specific publisher websites). Unfortunately, I cannot find a service to convert a Twitter list to an RSS feed, but there are services that will convert individual user timelines to RSS feeds. This, combined with an RSS feed aggregator, yields a similar result.

Here’s how to do it:

  1. Use TwitRSS.me to create an RSS feed for each person you want to follow. Simply type their username in the text box, click “Fetch RSS,” and copy the URL for the twitter feed. Even easier: the URLs are always of the form “https://twitrss.me/twitter_user_to_rss/?user={username},” so you don’t even have to go to the website to do this. Just copy this template, go to step 2, and change {username} for each user you want to follow.
  2. Use FeedKiller to aggregate the individual RSS feeds into a single feed. Unfortunately I cannot find a way to edit an existing aggregate feed, so if you ever want to add a new user you have to build the list again from scratch. I simply save all the TwitRSS.me links to a text file on my computer and then just cut-and-paste them all into FeedKiller each time the list changes. Note that you have to specify the number of items you want to fetch from each feed, with a max of 10. Give your aggregate feed a title, then click “build it!” This is important: make note of the ID at the end of the feed. The actual URL of your aggregate RSS feed is: http://feedkiller.com/files/inc.xml.php?id={ID}.

The missing part of the solution is filtering the feed. I have heard that FeedRinse is the best service for feed filtering, but it appears not to be working at the moment. They are promising a “2.0” upgrade at some point, but who knows when. I think that FeedRinse will also take care of Step 2 (i.e. it allows you to create an aggregate list and also filter it). I’ve signed up to be notified when the new FeedRinse is available, so I’ll update this post if and when I get the full solution working.

Update (12/17/2015)

Well, FeedRinse is still “under construction,” and I’m not holding my breath. I’ve have searched extensively but not found any RSS feed filtering service that actually works. I have also tried several times to make PubMed work for me, but I’ve been unhappy with the results. So I’m back to my old solution, which is to subscribe to the RSS feeds of all the journals I’m interested in and try to find time to sort through all of the articles I’m not interested in to find the few gems. However, I have also added two new strategies which I hope can replace my reliance on RSS feeds:

  1. I added a lot more scientists to my list in Twitter. I plan to keep track of all the papers I source via Twitter and the time I spend going through my Twitter feed, and compare that to my RSS feeds, to see which method is more efficient.
  2. I signed up for PubChase, which looks quite promising. I uploaded my entire Zotero library (~1000 publications), so it will be interesting to see how their automatic recommendations stack up against the papers I manually pull out of RSS and Twitter. I have also been signed up for SciReader (http://scireader.org/) for a while. While I usually find at least a few papers in the weekly e-mail digest, I’m frustrated and confused by their decision not to provide an RSS feed. Maybe if I complain enough they’ll change their minds.

Update (12/30/2015)

I recently discovered Nuzzel, which has made keeping up with publications via Twitter. Essentially, Nuzzel mines your Twitter feed for links and presents them in a nicely formatted list. You can sort by popularity or recentness (I prefer the latter). With Nuzzel, I now find that I only need a much smaller list of people I want to interact with at a personal level, which has greatly stemmed the tide of tweets, which in turn has made me feel less anxious about checking Twitter at every free moment lest I get buried in an avalanche of unread tweets.

PubChase’s recommendations have proven pretty good, but they don’t catch anywhere close to everything that I source from the journal RSS feeds. So for now I’ll continue to use RSS as my primary means of keeping up with the literature.

If you liked this post, you can share it with your followers or follow me on Twitter!

Papers2 python library

I created a python interface to the Papers2 database schema that can be used to extract information for export to other formats. Currently, there is a single export script provided to upload your library to Zotero.

If you liked this post, you can share it with your followers or follow me on Twitter!

R packages

I’ve started to organize my commonly used R code into libraries, and have made them available on GitHub. Note that these are not yet well-commented and don’t have unit tests. Please open issues for any bugs you find.

The available packages are:

  • knitrtools: Some useful tools for use with knitr document generation. Of particular interest is the figr.ref function, which relies on kfigr and provides nice formatting of intra-document references.
  • fancyplots: Implementation of specific plot types, mostly for genetic/genomic data including phylogenetic trees (see radial.phylog), spatial data (see plot.genome.data), and GWAS results.
  • statTests: Implementation of various statistical tests and distributions.
  • intensities: Methods for transformation and plotting of hybridization array intensity data. Includes a single-method implementation of tQN normalization.
  • geneticsFormats: Methods for converting between different genetics formats.
  • DEUtils: Convenience methods for working with RNA-Seq differential expression R packages, including DESeq2, DEXSeq, and GO enrichment.
  • miscUtils: Some miscellaneous functions, mostly for working with IO and collection data types.
If you liked this post, you can share it with your followers or follow me on Twitter!