Working the storage scalability problem

The environment and our problem

As a thought experiment let us pretend that we have a website that hosts user projects. Users are allowed to upload files to their projects. Individual files are limited to 100 MB but, there are no file type restrictions. Some time later, our users start asking for version control support. We note that the majority of files submitted by our users are either text files or source code and decide that Git is a good option for implementing a version control system.

Fast forward another couple of months. Our user base has continued to grow steadily and their projects are getting larger and larger. We realize that we have a problem; our server is going to run out of local disk space soon. What’s the best work around for this while sticking to our shoestring budget? That is a very good question.

Is there a simple solution?

When I was asked  to think of a solution, my mind immediately jumped to Amazon Web Services. Specifically, moving our server to an Elastic Compute Cloud (EC2) instance connected to an Elastic Block Store (EBS) volume for storage. Combining the power and resource scalability of an EC2 instance with an easy to use and fast EBS volume could solve the problem. However, after a bit of reflection some very big issues started to arise.

EBS volumes can only be mounted by one EC2 instance at a given time.

Mounting to a single EC2 instance makes scaling our EC2 setup horizontally problematic. If we need to have several EC2 instances communicating with the EBS volume, we have to setup a middle man to act as a go between from the EC2 instances doing the work and the EBS volume storing our Git repositories. So what, right? Well, now we have the added expense and maintenance (however minimal they may be) of another EC2 instance and we’ve introduced a single point of failure. But, those aren’t our only problems.

There is no way to easily resize EBS volumes.

Let’s assume for a second we went ahead and implemented the EC2 with an EBS volume setup. We have created a 500 GB EBS volume and mounted it to our EC2 instance. All is well for another few months until our estimates show that our Git repositories will surpass 500 GB within two weeks. In order to increase the size of our EBS volume we have to create an additional EBS volume at a new set size, let’s say 1 TB, mount it to our EC2 instance, copy everything over from the original EBS volume, finally we can destroy the old EBS volume.  Suddenly I’m having flashbacks to my professor discussing how to resize arrays in my data structures class. And don’t forget about that shoestring budget we’re on.

At $0.10 per GB we’ve just increased our price from $50/month to $100/month for space that we aren’t using. Furthermore, we don’t even know when it will be used and as soon as we approach the 1 TB mark, it will be time to start the process all over again. If we follow the same model of doubling our size, we will again double our costs for unused space. Why not increase the size of our EBS volume in smaller amounts more frequently? It doesn’t solve the problem; we are still purchasing unused space while adding complexity to our system architecture.

Is there an alternative solution? Perhaps a solution that will alleviate our bottleneck and resizing issues? There might be. And it could be much, much cheaper.

Introducing Amazon Simple Storage Solution (S3)

To quote Amazon, “S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web.” Basically, S3 provides us with a way to read, write, and delete files of almost any size (up to 5 TB per file) for very cheap. At a mere $0.076 per GB for the first 49 TB, and even less if we want more storage, we see some solid savings. Before adding our input/output costs,  we have just cut costs to 3/4 of what an equivalent EBS volume would have run us.

What about resizing?

S3 is truly scalable. There is no limit to how large our S3 account can become or how many files we can store in it. Furthermore, we never need to worry about logging into our AWS account and increasing our allotted size, or all of this nonsense regarding copying data from one volume to another; this is all done automatically allowing us to focus on what’s important, our data.

Okay, and the bottleneck?

Again, our problems have been resolved. There is no need to have a middleman connecting the S3 instance to our servers. We can access S3 from anywhere, on any server, through API calls. My language of choice is Python so I have to give a shout out to boto. The good folks that developed boto have provided us with an incredibly simple way to interact with S3. Don’t take my word for it though, Amazon wrote its own tutorial on how to use it.

What’s the catch?

There’s always a catch, right? Right. S3 isn’t exactly designed to work in the way we want it to for our problem. Unlike a storage device such as a hard drive, usb stick, or even an EBS volume, there is no file system. In the simplest terms possible, we can think of S3 as a file dump like Google Drive or Dropbox. Only, instead of uploading files through your browser or desktop application, we access it via API calls. These uploaded files are stored in “buckets,” which can be thought of as directories in an abstract sense. How is this a problem?

Well, let’s say we want to store a Git repository on S3 that looks like:


In short, we can’t. There are some solutions using file naming conventions that can be interpreted as directory structures but to be quite frank, they seem very “hackish.” I don’t like “hackish” solutions in production environments.

A possible workaround

I did some digging around and found that, as is always the case, I wasn’t the first person to try and store Git repositories on S3. Some very smart people took a very interesting approach using a Filesystem in Userspace (FUSE).  S3FS or, “Fuse over Amazon” allows us to mount S3 buckets as a local filesystem from which we can read and write to transparently. This sounds pretty cool but just how fast can a fake file system that has been placed on S3 and mounted locally be? Let’s find out.

Benchmarking S3FS with Git

Consistency concerns with S3

S3 has an eventual consistency model. This term means that after writing, updating, or deleting a file it will eventually be viewable from all parties. However, there is no time limit regarding how long this may take. My personal experience has shown changes are generally visible within a second or two.

In 2011, EU and US-west region S3 instances implemented Read-After-Write consistency. This means that changes to writes should be visible to all parties immediately after the write has occurred. Our use case for S3 is: Create a filesystem with S3 on top of which Git repositories can be created letting buckets represent the top level directories of our repositories.

Git very rarely deletes something. Instead it either adds a new hash and snapshot or a pointer to the previous, unmodified version of a file. As a result, I chose to use us-west, at the cost of slightly more expensive rates. Time to benchmark.

The setup

I conducted all of my benchmarks with Ubuntu 12.04. Identical setups were installed on a VirtualBox, running locally, as well as an EC2 instance. I will spare you the installation details. If you would like to have specific installation instructions ,I would recommend checking out this blog post.

After installing S3FS and setting up a test buckets, I created the following file structure:


Then, each of the following actions was executed 10 times:

  1. initializing the git repository
  2. adding all files to the repository
  3. committing the new changes to the repository

Surprisingly poor results

WITH caching FROM local machine ON Oregon
  • Git init times: 130s
  • Git add times: 15.5s
  • Git commit times: 38.4s

WITH caching FROM ec2 instance ON Oregon

  • Git init times: 103.2s
  • Git add times: 12.2s
  • Git commit times: 36.7s

WITH caching FROM local machine ON Oregon AND max_stat_cache_size=10000000

  • Git init times: 138s
  • Git add times: 16s
  • Git commit times: 39s

WITH caching FROM ec2 instance ON Oregon AND max_stat_cache_size=10000000

  • Git init times: 105.8s
  • Git add times: 12.4s
  • Git commit times: 37.7s

As you can see, initialization, add, and commit times were terribly slow. Although operations ran on the EC2 instance were noticeably faster than those ran on my local machine, they are still nowhere near quick enough. On the bright side, pulling files from S3 was rather fast. Every pull with the tiny repositories in my example took less than one second. But, the slow write times weren’t the only things I found to be a bit troubling.

During this process I also learned that S3FS mounts required root access, mounts were not always stable, and they would inconsistently disconnect. Unfortunately, this idea doesn’t seem to be a workable solution either.

Other Solutions


Jgit is a Java implementation of Git-scm and is hosted by eclipse – an open source Java IDE. I found this article describing how to use Jgit to store repositories on S3 using Jgit. However, I have no experience with Jgit or its community and this makes me hesitant to use it without a lot more research.


Remember how I said that pulling files from S3 was pretty fast? Actually, it was so strikingly fast compared to writing that it stuck in my head for a while and got me thinking about an alternative solution. What if we used the server’s local storage as a caching layer which works with our Git repositories until the project no longer needs immediate access? We could then use S3 as an external storage device for storing neatly packed repositories until they are needed?

That sounds like an interesting, fun project; more to follow. Time to do some more brainstorming.

"If at first you don't succeed, 
Try, try, try again." - William Edward Hickson


GitHub offers free student accounts?!

Last fall I started using GitHub to help me maintain and disseminate projects I work on. Not sure what GitHub is? From it’s main page, GitHub is a “Powerful collaboration, code review, and code management for open source and private projects.” 

Built on top of the incredibly powerful version control system Git, GitHub provides users a friendly, intuitive wrapper for Git’s somewhat confusing system on the web (or the desktop!). 

I could go on and on about how much I love Git or how GitHub has made working with peers/co-workers much less stressful, code reviews a pleasure, and handling complicated branches a breeze. It is pretty much awesome, right? Right. But it just got even MORE awesome!

One of my co-workers informed me she had signed up for a free student upgrade with GitHub. By default anyone can sign up and maintain as many public repositories as they like. However, in order to maintain private repositories one would need to sign up for at least a micro plan. Unaware of a student discount and always in search of a deal (every penny counts as a college student!) I wrote the GitHub support staff a quick email about my recent revelation.

In less than 24 hours the very friendly (and humorous) folks at GitHub had replied and upgraded my account to a free micro plan for two years! That’s a total savings of $168 (assuming prices don’t go up between now and then). Only if more companies were as student friendly. I give a tip of my hat to the folks at GitHub. I will be sure my local ACM chapter, peers, and developer friends know about this.

For more information about GitHubs student plan you can visit:

Additionally, I highly recommend reading more about Git itself. The good people who created it were kind enough to post a detailed, easy-to-read, and free book online here. If you are a book in hand kind of person, you can grab one from Amazon too!

Thanks, GitHub!


Tagged , ,

First Steps with Celery: How to Not Trip

Recently, I was tasked with integrating a task queue into a web framework at work. For the purpose of this post, I would like note that I am operating with Python 2.7.5, Flask 0.9, Celery 3.0.21, and RabbitMQ 3.1.3. This post was written using IPython 0.13.2 in an IPython notebook.

Now, I’ve never implemented a task queue before and boy did that ever make this difficult. A quick search result showed that Celery was the main player in the Python task queue arena.

Before diving into the code base at work I set up a virtualenv and followed Celery’s First Steps with Celery tutorial. It was easy, as was the Next Steps tutorial. I would go so far as to say they were too simple. When I went to apply my freshly earned skills to my code base I ran into a series of walls. Unfortunately, I didn’t have any luck pinging either Celery’s irc channel #celery or their Google group.

But, eventually I figured it out. I’m writing this so that you will (hopefully) avoid similar frustrations. Enjoy!

Picking a Broker

Celery requires a message broker. This broker acts a middleman sending and receiving messages to Celery workers who in turn process tasks as they receive them.

Celery recommends using RabbitMQ. I opted for this as my knowledge in this area is limited and assumed they would likely have the most thorough and robust documentation for it.

Installing RabbitMQ in Ubuntu is easy:

    $ sudo apt-get install rabbitmq-server

Installing it on a mac was also rather simple:

    $ brew update
    $ brew install rabbitmq

    # update your path in ~/.bash_profile or .profile with

Note: A co-worker ran into issues installing RabbitMQ via homebrew. To resolve this he followed the standalone mac installation instructions here.

Once installed, starting the server is as simple as:

    $ rabbitmq-server
    # or you can start in the background with
    $ rabbitmq-server -detached

And you can stop it with:

    $ rabbitmqctl stop

Installing Celery

Installing Celery was very simple. From within your virtualenv (you should be using virtual environments!):

    $ pip install celery

Setting up Celery config, Celery daemon, and adding ‘tasks’

The steps below are bit more convoluted than the aforementioned tutorial provided by the Celery team. This is meant to be more of a comprehensive ‘real world’ example. If you would like something simpler please go here

Project Structure:


Celery config —

    # config file for Celery Daemon

    # default RabbitMQ broker
    BROKER_URL = 'amqp://'

    # default RabbitMQ backend

There are a couple of things to note here. First, we are using RabbitMQ as the broker and the backend. Wait, what is the backend? The backend is the resource which returns the results of a completed task from Celery. Second, you may be wondering what amqp is. amqp is a custom protocol that RabbitMQ utilizes. More information on it can be located here.

More information on celery configuration and defaults can be found in the Celery docs.

Celery daemon: Preparing our daemon —

    from __future__ import absolute_import

    from celery import Celery

    # instantiate Celery object
    celery = Celery(include=[

    # import celery config file

    if __name__ == '__main__':

The two commented portions here can be a bit confusing.

    celery = Celery(include=[

Here we are instantiating a Celery object and handing it a list containing the relative (to where you start your Celery daemon!) path to all modules containing Celery tasks.


Next, we are telling that newly instantiated Celery object to import its configuration settings from celeryconfig.

Headache Number One: Celery and relative imports

I’m sad to admit that it look me 15 minutes figure out why I didn’t need in the same directory as my So, read this and learn from my stupid mistake.

Again, I want to emphasize everything is relative to where the Celery daemon is launched.

  • Our Celery daemon will be launched from /
  • Because the config file is located at /
  • The daemon looks for the config file in the root: celeryconfig
  • Additionally the module containing tasks is located several directories deep: /framework/email/
  • So the daemon thinks the is located several directories deep

Creating a task: Let’s queue up some emails! —

    from email.mime.text import MIMEText

    def send_email(to=None, subject=None, message=None):
        """sends email from hairycode-noreply to specified destination

        :param to: string destination address
        :param subject: subject of email
        :param message: body of message

        :return: True if successful
        # prep message
        msg = MIMEText(message)
        msg['Subject'] = subject
        msg['From'] = fro 
        msg['To'] = to

        # send message
        s = smtplib.SMTP('')
        s.login('YOUR_USERNAME', 'YOUR_PASSWORD')
        s.sendmail(', [to], msg.as_string())
        return True

Making this function into a task is as simple as importing our Celery object and adding a decorator (almost).

Recall that when we instantiated our Celery daemon we handed it a list of relative paths. One of those was to this file ‘’. When Celery is started it will comb over any files in that list and look for


So, let’s go ahead and modify our function to meet the spec.

    from email.mime.text import MIMEText

    # import our Celery object
    from framework.celery.celery import celery

    # add the decorator so it knows send_email is a task
    def send_email(to=None, subject=None, message=None):

    # code removed for brevity

If everything else is in order your app will be able to add these to the Queue by either calling the .delay() or .apply_async() functions. But, before we can do that let’s make sure our RabbitMQ server and Celery daemon are up and running.

Testing Our New Task

Launch RabbitMQ

Launch your RabbitMQ server in the background from the shell

    $ rabbitmq-server -detached

You can ensure it’s running the background by inspecting your processes

    $ ps aux | grep rabbit --color

Which should yield three things

  1. A very, very long output (this is the rabbitmq-server we just launched)
  2. The RabbitMQ daemon always running silently“hairycode 27491 0.0 0.0 599680 156 ?? S 5:24PM 0:00.33 /usr/local/Cellar/rabbitmq/3.1.3/erts-5.10.1/bin/../../erts-5.10.1/bin/epmd -daemon”
  3. And, the grep command you just executed“hrybacki 35327 1.2 0.0 2432768 596 s000 S+ 2:25PM 0:00.00 grep rabbit –color”

Note: If you see one or more additional of the “long” processes running you will run into issues. If this is the case stop all RabbitMQ servers

    $ rabbitmqctl-stop

and start over. I will provide an example of what can go wrong if there are multiple brokers or Celery daemons running at once.

Launch the Celery daemon

From the project/ directory launch the Celery daemon

    $ celery -A framework.celery.celery worker -l debug

which should give you a daemon monitor without put along the lines of

     -------------- celery@Harrys-MacBook-Air.local v3.0.21 (Chiastic Slide)
    ---- **** ----- 
    --- * ***  * -- Darwin-12.4.1-x86_64-i386-64bit
    -- * - **** --- 
    - ** ---------- [config]
    - ** ---------- .> broker:      amqp://guest@localhost:5672//
    - ** ---------- .> app:         __main__:0x10f5355d0
    - ** ---------- .> concurrency: 4 (processes)
    - *** --- * --- .> events:      OFF (enable -E to monitor this worker)
    -- ******* ---- 
    --- ***** ----- [queues]
     -------------- .> celery:      exchange:celery(direct) binding:celery


    [2013-07-23 15:46:55,342: DEBUG/MainProcess] consumer: Ready to accept tasks!

-A framework.celery.celery worker

informs Celery which the app instance to use and that it will be creating workers. Workers take tasks from the queue, process them, and return the result to the message broker.

-l debug

tells Celery that you want it to display log level debug output for testing purposes. Normally you would execute -l info for a log level info output.

Now, let’s make sure we have some Celery workers up and running

    $ ps aux | grep celery --color

Note the concurrency number when we launched the Celery daemon. This is the number of processors and in turn workers which should have been launched. The grep output from the previous command should leave you with that many outputs similar to

    hairycode       37992   0.1  0.4  2495644  33448 s001  S+    3:20PM   0:00.74 /Users/hairycode/git/staging-celery/venv/bin/python /Users/hairycode/git/staging-celery/venv/bin/celery -A framework.celery.celery worker -l debug

Detailed information about launching the Celery daemon can be found here or from the shell

    $ celery --help

Testing with IPython

Note: I am using IPython from the root directory in the code segment below. You could just as easily, well maybe not easily, use the standard Python interpreter or write a test script in Python. But, IPython is awesome. I like awesome things.

Executing our Task

    # import celery
    import celery

    # import our send_email task
    from import send_email

    # call our email function
    result = send_email.delay('', 'all your smtp are belong to us', 'somebody set up us the bomb')


If you look at your Celery daemon you can see the task coming in, being processed, returning the result, and even how long it took to execute. For example the call above gave me the following output

    [2013-07-23 15:48:29,145: DEBUG/MainProcess] Task accepted:[09dad9cf-c9fa-4aee-933f-ff54dae39bdf] pid:39336
    [2013-07-23 15:48:30,600: DEBUG/MainProcess] Start from server, version: 0.9, properties: {u'information': u'Licensed under the MPL.  See', u'product': u'RabbitMQ', u'copyright': u'Copyright (C) 2007-2013 VMware, Inc.', u'capabilities': {u'exchange_exchange_bindings': True, u'consumer_cancel_notify': True, u'publisher_confirms': True, u'basic.nack': True}, u'platform': u'Erlang/OTP', u'version': u'3.1.3'}, mechanisms: [u'AMQPLAIN', u'PLAIN'], locales: [u'en_US']
    [2013-07-23 15:48:30,601: DEBUG/MainProcess] Open OK!
    [2013-07-23 15:48:30,602: DEBUG/MainProcess] using channel_id: 1
    [2013-07-23 15:48:30,604: DEBUG/MainProcess] Channel open
    [2013-07-23 15:48:30,607: INFO/MainProcess] Task[09dad9cf-c9fa-4aee-933f-ff54dae39bdf] succeeded in 1.46279215813s: True

some_task.delay() vs some_task.apply_async()

some_task.delay() is a convenient method of calling your function as it looks like a regular function. However, it is short hand for calling some_task.apply_async(); apply_async() is a more powerful and flexible method for calling your tasks. Detailed information on both can be located here.

Executing our task — more realistically

The AsyncResult is the Celery object that the backend (RabbitMQ) returned after the worker (Celery) completed the task. The long string following it is the task_id. More often you won’t assign the function call to a variable. Doing so would hold up our app until the task had completed. That wouldn’t make much sense would it? Rather, you will simply call the delay or apply_async function and let your code continue on like this

    # import celery
    import celery

    # import our send_email task
    from import send_email

    # call our email function
    send_email.delay('', 'all your smtp are belong to us', 'somebody set up us the bomb')

Remember, we still have the task id. If you want to check the status or result of what we just submitted you can do so by asking the task queue

    # grab the AsyncResult 
    result = celery.result.AsyncResult('09dad9cf-c9fa-4aee-933f-ff54dae39bdf')

    # print the task id
    print result.task_id

    # print the AsyncResult's status
    print result.status

    # print the result returned 
    print result.result

This is a very basic run down. If you want to much more detailed information on this I would recommend checking out the Calling Tasks section of Celery’s documentation.

Headache Number Two: My Celery daemon is only receiving every other task? Wat.

This little bug took me entirely too long to solve. At some point I started noticing that exactly half of the .delay() calls I was making were permanently in a state of PENDING.

For example, running this

    ###IPython output

    from import send_email

    send_email.delay('', 'all your smtp are belong to us', 'somebody set up us the bomb')

    send_email.delay('', 'all your smtp are belong to us', 'somebody set up us the bomb')

Gave the following output from the Celery daemon

    [2013-07-22 18:18:44,576: DEBUG/MainProcess] Task accepted: tasks.test[0e55bfed-1f05-4700-90fe-af3dba34ced5] pid:7663
    [2013-07-22 18:18:44,583: DEBUG/MainProcess] Start from server, version: 0.9, properties: {u'information': u'Licensed under the MPL.  See', u'product': u'RabbitMQ', u'copyright': u'Copyright (C) 2007-2012 VMware, Inc.', u'capabilities': {u'exchange_exchange_bindings': True, u'consumer_cancel_notify': True, u'publisher_confirms': True, u'basic.nack': True}, u'platform': u'Erlang/OTP', u'version': u'2.8.4'}, mechanisms: [u'PLAIN', u'AMQPLAIN'], locales: [u'en_US']
    [2013-07-22 18:18:44,585: DEBUG/MainProcess] Open OK!
    [2013-07-22 18:18:44,585: DEBUG/MainProcess] using channel_id: 1
    [2013-07-22 18:18:44,586: DEBUG/MainProcess] Channel open
    [2013-07-22 18:18:44,589: INFO/MainProcess] Task[0e55bfed-1f05-4700-90fe-af3dba34ced5] succeeded in 2.0180089473724s: True

0e55bfed-1f05-4700-90fe-af3dba34ced5 was there but af3846a9-4a31-4a8d-99a4-0d990d51ef22 wasn’t.

I restarted my Celery daemon. Same thing.

I restarted my RabbitMQ server. Same thing.

I created an entire new project and followed the First Steps with Celery docs. Same thing.

Confused I searched around but I could only find one other person who had encountered something similar and that issue was over a year old. Note: I tried his solution but it didn’t resolve the issue.

The trick was that somewhere along the line I had another set of Celery workers running in the background that were not part of the daemon I had just started running. These workers were taking tasks from the queue and I wasn’t getting them back. I was able to recreate the same bug by having a second instance of RabbitMQ server running.

Remember when I told you to ensure you only had one RabbitMQ server and the correct number of concurrent Celery workers running by checking your processes? This is why. Don’t do this.

Let’s Improve our Setup

Adding logs

Adding logs was pretty straightforward. First, we need to modify our to specify where we want our logs:


    # default RabbitMQ broker
    BROKER_URL = 'amqp://'

    # default RabbitMQ backend

    # specify location of log files

Now, we implement logging within the task itself.

After importing the required function, we grab the logger associated with our Celery app

    logger = get_task_logger(__name__)

Then, at the desired point log a custom message to log level info. Note: If you desired to log to another level e.g. debug you would use logger.debug(…)'Sending email from: %r, to: %r' % (fro, to))

The resulting looks like:

    from email.mime.text import MIMEText
    from framework.celery.celery import celery
    # import the Celery log getter
    from celery.utils.log import get_task_logger

    # grab the logger for the Celery app
    logger = get_task_logger(__name__)

    def send_email(to=None, subject=None, message=None):
        """sends email from hairycode-noreply to specified destination

        :param to: string destination address
        :param subject: subject of email
        :param message: body of message

        :return: True if successful
        # prep message
        msg = MIMEText(message)
        msg['Subject'] = subject
        msg['From'] = fro 
        msg['To'] = to

        # log desired message to info level log'Sending email from: %r, to: %r' % (fro, to))

        # send message
        s = smtplib.SMTP('')
        s.login('', '')
        s.sendmail(', [to], msg.as_string())
        return True

And that’s it! After implementing logging, tasks should be adding your messages to their respective log files e.g.:

    [2013-07-23 15:48:29,145: INFO/MainProcess] Sending email from:, to: ''


Learning Celery has been… frustrating. The above examples barely begin to scratch the surface of what it’s capable of. It is an incredibly powerful and configurable tool I would however, like to see a more responsive community but, I understand we all busy people. Queuing tasks is a necessity for any major application and I’m beginning to develop a love-hate relationship with Celery. More to follow?


Tagged , , , , , ,

SciPy 2013: Day Four/Five

Day Four: Talking, myself and so many others talking…

I’ve been to several poster sessions since going back to school two summers ago. However, this was the first session that I presented in. My poster was for Scholarly. Scholarly, is a personal research project involving the network of scholarly citations that started off as a flop and has grown into a little monster. Feel free to read more about it on my current projects page, visit the GitHub repository, or download the poster. It was nice to see so many people interested in the project. Almost everyone instantly realized the importance this dataset will have and many asked the question, when can we access the data? Unfortunately, I don’t have a solid answer to that question — yet. But, we are making progress and our data collection servers should be on the web soon! Now, on to what other people spoke about.

Thursday was well, more talks. By the end of the day I was quite exhausted. While, many of them were incredibly interesting I guess I’ve discovered my limit for sitting and listening to people while I clack away at a computer — three days.

One talk I found of particular interest was Analyzing IBM Watson experiments with IPython Notebook by Torsten Bittner from IBM. I won’t go into all of the details but, the Watson team was able to take ~8,350 SLOC written in Java used for training and testing Watson to ~220 SLOC in an IPython notebook. And to make this even more impressive, it ran faster in the notebook. And with that, I’ll move into day five.

Day Five: Let the sprints commence!

Sprints? At a software conference? If you’re picturing 100 sweaty programmers running down hallways for fresh coffee you’re a bit off. There is coffee and occasionally sweat. But, very, very little running.

The sprints are an opportunity for open source projects to get people involved regardless of their skill level. It’s a pretty cool experience. You’re surrounded by programmers from all backgrounds, some legendary, most you’ve never heard of before, but all of whom are very approachable and patient.

After attending the morning pre-sprint session all of the projects distributed themselves in various rooms and went to work. I decided to try something a bit different today.

With all of this ranting and raving I’ve been doing about IPython notebooks I decided I needed to see what I could do with it. Here is the result. I was able to successfully integrate the notebook while benchmarking some search queries in a MongoDB instance populated with fake citation data. It wasn’t cumbersome to do so and documenting as I went along was actually quite enjoyable.

Time for a beer.


Tagged , , ,

SciPy 2013: Day Three

Opening Remarks:

With tutorials concluding yesterday, today began the talks. This is the largest SciPy to date. The registration increase of ~75% over last year was easily noticeable at the opening remarks as I sat in a packed room of fellow coders and scientists.  Co-hosts Andy and John announced that the primary themes for this years conference are reproducibility and machine learning before introducing the keynote speaker Fernando Perez.

Keynote: Fernando Perez of IPython — IPython: from the shell to a book with a single tool – The method behind the madness

As you would expect from Fernando, the talk was fast paced, informative, and enjoyable. The real icing on the cake however, was the delivery — an IPython Notebook slide show. I’ve already gone into who excited I am about the IPython notebook; I think it will make an awesome medium for teaching CS students. The slide show only further enhances the usefulness of the IPython tool set. -steps down from soapbox- Anyway, I’ve tried to summarize points from Fernando’s talk I found to be of interest.

After a brief set of opening remarks about the amazing spirit of the community and the phases of the research life cycle, Fernando explained some of IPython’s major mile stones:

  • 2001 – First version of IPython (it was only 259 lines of code!) It’s primary goal was to provide a better interactive Python shell.
  • 2004 – Interactive plotting with matplotlib.
  • 2005 – Interactive parallel computing.
  • 2007 – IPython embedding embedding in Wx apps.
  • 2010 – An improved shell and a protocol to go along with it.
  • 2010 – After 5 attempts, a sixth leads to what we now know as IPython notebook.
  • 2010 – Sharing notebooks with zero-install via nbviewer
  • 2012 – Reproducible research with IPython.parallel and StarCluster
  • 2012 – IPython notebook-based technical blogging
  • 2013 – The first White House hackathon (IPython and NetworkX go to DC)
  • 2013 – IPython notebook-based books: “Literate Computing” Probabilistic Programming and Bayesian Methods for Hackers.

Continuing, Fernando explained many of the lessons he’s learned since starting the project, highlighted alternative use cases written around IPython and IPython notebooks, thanked the community, and gave us some ideas into what lays ahead for IPython (1.0 in a few weeks!). Once the video becomes available, I will be sure to add it here and I highly recommend you watch it!

“The purpose of computing is insight, not numbers” — Hamming ’62


Tagged , , ,

SciPy 2013: Day Two

Tutorial Three: An Introduction to scikit-learn (I) – Gaël Varoquaux, Jake Vanderplas, Olivier Grisel

For a long time I’ve been very curious about machine learning. Up to this point it’s appeared to me much like a mystical unicorn. They seem really cool but you never really know much about them. This tutorial provided me with an excellent chance to break that mysticism down.

After a brief introduction to scikit-learn and a refresher on numpy/matplotlib we used IPython notebooks to walk through basic examples of what the suite is capable of. We then moved into a quick overview of what machine learning is and some common tactics for tackling data analysis. Now that we were a bit more familiar with the suite itself and machine learning principles, we moved onto more complex examples.

Again, using IPython notebooks we walked through examples of supervised learning (classification and regression), unsupervised learning (clustering and dimensionality reduction), and using PCA for data visualization. We ended the morning session with a couple of more advanced supervised learning examples (determining numbers of hand written digits and Boston house prices based on various factors) and an advanced unsupervised learning example in which we analyzed over 20,000 text articles to determine from which of four categories they likely originated.

One note for further research: How much data should be used for train vs test data? What factors play a role in this and are there any common standards or practices which researchers follow?

Tutorial Four: Statistical Data Analysis in Python – Christopher Fonnesbeck

Statistics is an area of for me. Combine that interest with Python Pandas and you’ve got an instant winner, right? Not exactly. While the talk was tagged for beginners it proved to be otherwise.

The speaker clearly had a very strong background in statistics. However, those that background didn’t transition into an easy to follow talk. The statistics language was very far above me and most of the room — if my observations were correct. Additionally, the version of pandas he used wasn’t the same as the version in the required packages noted in the talks description. This resulted in the majority of us not being able to follow along in IPython notebooks and being forced to watch him on the projector.

Please, don’t mistake this for a whine session. Chris knew his stuff and he was able to answer everyones’ questions and smashed some ‘stump the chump’ attempts without batting an eye. But, the talk should have been refined and rehearsed and versions of required packages should have been vetted earlier.

You can’t win them all, right?


Tagged , , , , ,

SciPy 2013: Day One


This year I am fortunate enough to be able to attend SciPy. SciPy is a Python conference focused on scientific programming. A big shout out to the Center for Open Science for making this trip a possibility. This will be the first of a (hopefully) daily blog series in which I will briefly cover how my day went and any lasting impressions it left me with.

The conference organizers made checking in  quick and painless. We were served a breakfast buffet that was surprisingly good. Of paticular interest to me were the scrambed egg mini-bagels and lemon poppyseed bread slices — yum. Breakfast as followed by a series of tutorials which registrants chose in advance.

Tutorial One: Guide to Symbolic computing with SymPy — Ondřej Certik, Mateusz Paprocki, Aaron Meurer

The SymPy tutorial was … interesting. We simulataneously listened to the lecturer discuss and demonstrate common SymPy functions while completing examples in various IPython notebooks that they provided us with. It was fast paced, too fast for me anyway, and we had to skip over a lot of material.

The tutorial docs recommended experience with IPython and ‘basic mathematics.’ However, I was quite surprised how far my definition of basic mathematics was from theirs. Unfortunately, this left me struggling to keep up with the tutorial even from early on. After a mid-session break, we briefly covered Calculus functionality before being introduced to some real world applications. This is where I became utterly lost.

SymPy’s original lead developer showed us several examples of his use of SymPy while preparing his dissertation in Chemical Physics. These included Poisson Equations, Fast Fourier Transform(s) (FFTs), and Variation with Lagrange multipliers. Don’t know what some (any) of those are? It’s okay, neither do I. On a side note, they did show an interesting ‘hack’ using javascript injectin in an iPython notebook which allowed them to manipulate 3D figures.

While the tutorial itself felt a bit unpolished, the instructors knew there stuff. All in all SymPy seems like a really interesting tool which I plan to use. When combined with IPython notebooks I believe it could create very powerful, long lasting notes for a variety of math intensive classes. I’ll be testing this out next semester in Physics.

Tutorial Two: IPython in depth — Fernando Perez, Brian Granger

Anyone who has listened to either Fernando or Brian could have told you that his tutorial was going to be good. It was. They provided a solid tutorial environment with IPython notebooks that kept me feeling like I was actively working with them throughout the entire tutorial. Whenever anyone had a question they knew how to answer them quickly and concisely.

A few things I found of paticular interest:

  1. IPython Notebook: If haven’t heard of this, click the link and check out. I’m not kidding. This is a versatile web tool that is incredibly powerful. For some cool examples of what people have done with notebook (including writing a book!) click here.
  2. Awesome help functionality: With IPython’s built in help functionality  [ ?, ??,  %quickref, and %magic ] you can quickly get a syntax highlighted help description, the source for a module, or even access a nifty quick reference guide mostly eliminating the need  to pop out of a notebook or console and visit online docs.
  3. Kickass debugger: IPython’s shell is amazing. But, I’ve found myself using PyCharm for more advanced bit of code while debugging. After learning about IPython’s magic %debug and %run -d that may have changed. They provide you with very powerful and easy to use debugging abilities I wasn’t even aware existed.

Day one down. Time for sleep.


Tagged , , , , ,

Two years, too quickly

Today marks two years from my end of service date from the Army. The time has gone by too quickly and a lot has happened  which I will leave out of this post for the sake of brevity. I would however, like to to take thank all of the great leaders and Soldiers who helped shape the individual I am today and apologize to any of my Soldiers I may have let down or whose standards I failed to live up too — I will remember you all for as long as I live.

“It doesn’t matter what you do. If you’re a broom sweeper, be the best damned broom sweeper you can be. Everything else will fall into place from there.” – 1SG Pendleton


Tagged , ,

Heroku + Flask: Simplicity

After playing around with Amazon’s free tier EC2/EBS instances, I found myself becoming frustrated with the amount of work it took to get something simple up and running. Take this with a grain of salt; I am by no means a command line master or an experienced dev ops guy. Shortly after mentioning this to a co-worker, I was directed to Heroku.

It was a dream. Heroku takes care of almost all of the dev ops stuff that can make life a nightmare. Following their getting started with Python documentation was incredibly straightforward. While their pricing seems like it can be a bit high, they have certainly provided a a well refined product to a market of developers who either do not have the interest in configuring environments in the cloud or don’t have the time to dedicate to its maintenance. Well done, Heroku.

Heroku and Flask Setup in Ubuntu 13.04:

This is a brief walk through of setting up and launching a simple Heroku environment and Flask server. It follows closely with the official Heroku Python docs with a couple of my personal notes. Enjoy!

  1. Sign up for Heroku.
  2. Download the the Heroku toolbet:
  3. $ sudo su
    $ wget -qO- | sh

  4. Login to Heroku: If prompted to generate a key enter yes.
  5. $ heroku login

  6. Make a directory to hold your app, create a virtual environment, and activate it.
  7. $ mkdir /path/to/your/app
    $ cd /path/to/your/app
    $ virtualenv venv –distribute
    $ source venv/bin/activate

  8. Install Flask, our framework, and gunicorn, our web server.
  9. $ pip install Flask gunicorn

  10. Now, create a very simple hello world application.
  11. $ vi

    import os
    from flask import Flask

    app = Flask(__name__)

    def hello():
    return ‘Hello World!’

  12. Next, we need to create a Procfile which will to state which command should be called when the web dyno starts; Gunicorn in our case.
  13. $ vi Procfile

    web: gunicorn hellp:app

  14. Finally, start the process using Foreman from the command line and verify it’s up and running.
  15. $ foreman start
    $ python
    >>> import requests
    >>> response = requests.get(‘′)
    >>> response.text
    u’Hello World!’
    >>> exit()

  16. Heroku requires a requirements.txt file in the root of your repository to identify it as a Python application. Fortunately,  pip does this for us. From the root directory:
  17. $ pip freeze > requirements.txt
    $ cat requirements.txt

  18. Create and setup your .gitignore file to disregard files we don’t want being uploaded to our repository.
  19. $ vi .gitignore

    1 venv
    2 *.pyc

  20. Next, initiliaze the git repo, add your app to it, and commit it.
  21. $ git init
    $ git add .
    $ git commit -m “Initial commit of my Heroku app.”

  22. Push your repo to Heroku and launch your app.
  23. $ heroku create
    $ git push heroku master
    *Note: If you see something like: ‘Push rejected, no Cedar-supported app detected’, verify that your requirements.txt exists, is complete, and it is in the root of your application directory structure.*
    $ heroku ps:scale web=1

  24. Visit your app!
  25. $ heroku open

And that’s basically it. Please visit the official Heroku Python docs for more information.

A few useful commands:

  • heroku logout — logs you out
  • heroku ps:scale worker=0 — stops your dyno
  • heroku run python — launches an interactive Python shell


Selenium: Initial thoughts + code

I spent the greater part of a day getting familiar with Selenium. Selenium is, as per its homepage, a suite of tools for automating web browsers. So, why would I want to bother automating browsers? Excellent question.

Selenium provides developers with an easy way to automate testing common use cases. For example, If I want to make sure that when a user is attempting to login to a page and they hit the enter key after entering there credentials it works, in Firefox, or Chrome, or IE, I can do that. Furthermore combined with tools like Sauce Labs I can automate this one step further. I can write one test and now I’m testing that the my code runs as expected (or discovering it doesn’t) on up to 128 different browser and version combinations! That is powerful.

Continuing on with the aforementioned login example, let’s dig into some actual code.
This is an example of an early version of a test I wrote in Python using Selenium’s WebDriver.

First, unittest and the relevant selenium utils are imported: webdriver and Keys. The former allows Selenium to use a native browser and Keys provides functionality replicating key strokes and mouse clicks.

import unittest

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

class AccountProfileTest(unittest.TestCase):

After subclassing Python’s unittest, setUp and tearDown methods allow us to properly prepare the environment for each test as well as take care of any cleanup necessary after it’s been run. The setUp function is where we will be focusing as it is where the login is performed.

First, test data is declared that will be used during user creation. Next, we launch a Firefox instance and direct it toward the desired url. The implicitly_wait function ensures our driver will allow the page to load before attempting to ‘find’ elements within the code. This prevents the driver from failing to locate specific elements. It should be noted that although Selenium is pointed a a local web server, this test was constructed for

    # initialize a Firefox webdriver
    def setUp(self):
        # setup browser and test user data
        self.user_email = ''
        self.user_password = 'testtest'
        self.user_name = 'Testy McTester'
        self.driver = webdriver.Firefox()
        # ensure driver allows page source to load before continuing
        # load OSF homepage

Here is where it gets interesting. Selenium is a browser automation tool, right? Right. So how do we replicate a mouse click on a specific button? Well, first we need to find it. As an aside, I _highly_ recommend using Firebug or a similar tool if you decide to play with Selenium and manually walk through test development. After first locating the element you want to click in the source we use one of Selenium’s find_element_by_* methods to focus on it. Selenium provides many find functions such as id, css, text, link_text, and xpath to name a few. Futhermore, you have the option of either find_element_by_* to locate the first element found or find_elements_by_* which returns a list of all elements found matching the pattern. In this example I focus on xpath but give examples of grabbing both a single element as well as a list.

The first find method I call is for the a link on the page directed toward the ‘/account’ url. Clicking it with a mouse is emulated by chaining the click method onto the end of the call.

        # load login page

After the page has loaded (as per driver.implicitly_wait() in the setUp method), the driver locates and assigns a list of username and password fields to username_fields and password_fields respectively. You may have noticed that find_elements_by_xpath was used here. find_element_by_* would have worked all the same but I thought it would be nice to include an example.

        # grab the username and password fields
        username_fields = self.driver.find_elements_by_xpath('//form[@name="signin"]//input[@id="username"]')
        password_fields = self.driver.find_elements_by_xpath('//form[@name="signin"]//input[@id="password"]')

Now that we have a list of username and password fields how do we enter Testy McTester’s login credentials? Selenium makes this easy for us with the send_keys method. As with the simulated mouse click, once you have located the pointer call the appropriate method, send_keys, along with whatever you desire. Here, point at the first username and password field in each list and send the user_email and user_password constants declared in setUp.

        # enter the username / password

Finding the ‘sign in’ submit button is a bit trickier. First, we must find all button that have ‘submit’ types. Next, we iterate through the returned list and find the button whose text is ‘sign in.’ Once we have it, we can simulate the clicking it with a mouse with the click method.

        # locate and click login button
        submit_buttons = self.driver.find_elements_by_xpath('//button[@type="submit"]')
        submit_button = [b for b in submit_buttons if b.text.lower() == 'sign in'][0]

After setUp has completed, the first unittest below will be executed and any assertions declared will be tested. After it has completed the tearDown function will execute. Here, it simply closes the Firefox instance deleting any cookies and clearing the cache to ensure that if setUp is called again it will have a clean slate.

    # close the Firefox webdriver
    def tearDown(self):

And that’s it. This is a very simple example of what Selenium can do and by no means shows anywhere near all of its potential. Not only is Selenium an incredibly powerful tool but it is really fun to work with as well. Manually walking through each step in the browser with Firebug provided me with a pleasant escape from the standard code, code, code.

The current version of this test, and many, many more are up on GitHub and are free for use. My co-workers and I would also appreciate any feedback you may have!