Distributed Processing with EC2 using PoolParty


So PoolParty seems nice.  Tonight I’ll be attempting to use it (primarily because it gives me shared disk across instances via S3 for free) to replace a script’s use of Sun Grid Engine (SGE) with EC2.  This happens to be all I need to pull this off fairly easily.

So I’m using this to parallelize FSL, a bit of software that analyzes fMRI data.  Specifically, I’m modifying the parallelization function to handle EC2 on top of the Sun Grid Engine support it already has.  The current process ends up with a file called commands.txt.  Each line needs to be run on its own EC2 instance with access to the shared data directory, in which the output will be placed.  The current shell script will then take over, as that’s the behaviour that SGE ends up with for this task.

To help with this, Ari (the poolparty guy) wrote up a nice PoolParty Plugin for this very task.

We’ll ignore the actual use case (just makes a good story) and simplify the problem to: launch a pool party that outputs a command to /data/log/output1, output2, etc.

My commands.txt file looks like:

ls /bin > /data/output1
ls /root > /data/output2

Then it’s just a matter of pool -v -i -I ami-1234 -b shared_bucket.  This launches the appropriate number of instances and at the end of the process all of the data has been written to the appropriate place.

I actually wrote no ruby code for this.  Ari wrote the plugin that handled my whole use case, but it was really nice to see how tiny the plugin can be.  Ultimately, this gem looks intensely interesting for our generic use case, which was simply distributed processing of a massively parallelizable computation.  It completely replaces the need for Sun Grid Engine in this case.

Everyone join the Pool Party.

NOTE: Documentation is sparse at present, but Ari is available on Freenode#poolpartyrb and I would be glad to help anyone that wants to talk about how to use PoolParty in whatever small way I can.  Ari has said that this week is to be documentation week for the project.  And go donate to the project if you can.


One Response to “Distributed Processing with EC2 using PoolParty”

  1. Thank you

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: