API Testing with Flask: Post

Have you ever tried to test POST requests to an API written with Flask? Today I was going through an old code base writing unittests that should have been written eons ago when I ran into a snag. The issue stemmed when I tried manually setting the headers Content-Type.

But before we get to that, let me show you the skeleton of the Flask route I was testing:

@app.route('/someendpoint/' methods=['POST'])
def some_endpoint():
    """API endpoint for submitting data to

    :return: status code 405 - invalid JSON or invalid request type
    :return: status code 400 - unsupported Content-Type or invalid publisher
    :return: status code 201 - successful submission
    """
    # Ensure post's Content-Type is supported
    if request.headers['content-type'] == 'application/json':
        # Ensure data is a valid JSON
        try:
            user_submission = json.loads(request.data)
        except ValueError:
            return Response(status=405)
        ... some magic stuff happens
        if everything_went_well:
            return Response(status=201)
        else:
            return Response(status=405)

    # User submitted an unsupported Content-Type
    else:
        return Response(status=400)

Ignoring the magic, everything seems in order. So, lets go ahead and write a quick unittest that posts an invalid json.

    def test_invalid_JSON(self):
        """Test status code 405 from improper JSON on post to raw"""
        response = self.app.post('/someendpoint',
                                data="This isn't a json... it's a string!")
        self.assertEqual(response.status_code, 405)

Cool, let’s run it!

    Failure
    self.assertEqual(response.status_code, 405)
    AssertionError: 400 != 405

A quick glance at the code and I realize I’ve forgotten to set the headers Content-Type! Wait. How do I set the Content-Type? That’s a good question. After searching around for people who had run into similar problems this is what I came up with:

    def test_invalid_JSON(self):
        """Test status code 405 from improper JSON on post to raw"""
        response = self.app.post('/someendpoint',
                                data="This isn't a json... it's a string!",
                                headers={'content-type':'application/json')
        self.assertEqual(response.status_code, 405)

Let’s try this again.

    Failure
    self.assertEqual(response.status_code, 405)
    AssertionError: 400 != 405

Hmmm, okay. Next, I decided to inspect the request coming into the flask app and found something odd in request.headers:

    Host: localhost
    Content-Length: 10
    Content-Type: 

Why is the Content-Type empty? Another search gave hinted at another possible solution. Why not just build the headers dict inline?

    headers=dict(content-type='application/json') # But that's not right. We can't have '-' in a key.

By this point I’ve become agitated. Neither the Flask docs themselves nor various forums have been of much use. Then I stumbled across this.

Perhaps I missed something in the docs. Either way, I learned that you can hand the Content-Type as a parameter to the post method. It works and it looks much cleaner. Let’s revise my initial unittest accordinlgy:

    def test_invalid_JSON(self):
        """Test status code 405 from improper JSON on post to raw"""
        response = self.app.post('/raw',
                                data="not a json",
                                content_type='application/json')
        self.assertEqual(response.status_code, 405)

And, let’s run this one last time and look at the request as it comes though.

    Tests passed
    Host: localhost
    Content-Length: 10
    Content-Type: application/json

Much better. Now, back to testing!

-H.

Tagged , , , , ,

Resumes…

Currently, I am attempting to trim my CV down to a length appropriate for a resume. To be honest, it’s not as easy as I would have thought. Wouldn’t it be more simple if a resume could just simply read the following:

Favorite language: Python. Why? It’s awesome and so is the community it’s built up around it.

Favorite IDE: None. I like Vim (shhhh emacs lovers). Again, why? It’s quick, powerful, and I can use it on any of my machines or while I’m remoted into some server. I value consistency.

Favorite OS: Hands down, any of the Linux derivatives. Does this require justification? Nope.

Favorite place to code: Anywhere with a nice view and lots of natural light.

Anything other questions? Let’s discuss this over a coffee or tea. And don’t forget to checkout my Github account!

-H.

Tagged

Everyone makes mistakes: Even the web’s inventor

Earlier today, while reading over Interactive Data Visualization for the Web, I came across an interesting fact I would like to share. We all know the preamble for web addresses, http://. I couldn’t fathom how many times I’ve typed it myself. Apparently the creator of the world wide web, Tim Berners-Lee, regrets making this part of the standard as he stated during an interview with the New York Times. To summarize, he equated it to be a convention of programmers at the time.

I wonder how many double slashes have been typed, printed, and read since the standards creation. But hey, there is always room for improvement, right?

As an aside, I highly recommend the aforementioned book, Interactive Data Visualization for the Web. Written by Scott Murray, it provides an excellent introduction to data visualization with D3 and basic web programming. What makes this even more cool? O’reilly is offering an online version of the book that includes interactive examples and it’s free! Click the previous link to view the book directly or visit this O’reilly page for more information on about the paper version!

“First make something work. Then, make it work better.” – Anonymous

-H.

Tagged , , , , ,

The Division Of Generation Y

hairycode:

Many of my private thoughts put down as words.

Originally posted on Thought Catalog:

America’s Generation Y can be divided into two distinct groups: Those who served in Iraq and/or Afghanistan, such as myself, and those who didn’t. Taking an educated guess, I assume a lion’s share of the readership of Thought Catalog are liberal arts degree bearing, student-loan debt ridden types who think those who joined the military were too stupid to go to college and were unaware cogs in the political war machine run by evil multi-national corporations with the goal of maximizing profit and exploiting the lower class. In turn, we think you’re a bunch of overly sensitive, pretentious, hyper-liberal pussies, so its even. Now, let’s begin to gain an understanding of each other’s perspective.

Our memories of our formative years are quite different. You headed out into early adulthood going to community college or university, be it full-time or part-time. You may have gotten a student…

View original 1,276 more words

DevTalks: Bridging the gap between student and developer

As my fifth semester cruises along at what feels like mach speed, I find myself neck-deep in finite state machines, language grammars, parse trees, and wave optics. Despite putting in more hours than any previous semester (I started tracking out-of-class work hours with Klok2), I feel more prepared to handle it than would have thought possible. But, why do I feel so ready to tackle mountains of homework with countless hours spent at the whiteboard? I think the primary reason is that over the past couple summers I’ve had two amazing internships that allowed me, pushed me even, to use the concepts I learned at school in the real world. This has given me a much deeper understanding and appreciation for school. I can recall the first two semesters and how pointless what I was learning felt at times. Without my practical experience from internships and personal projects would I still feel that way? Probably.

This brings me to a problem I have noticed among my peers; they have a severe lack of motivation. More often than not, I watch people with glazed over eyes in my classes who at worst just do the minimum to pass and at best do the minimum to earn an A. What are they missing? They don’t have the drive they need. But, I’m trying an experiment this semester to see if I, with the help of some very smart people, can change this.

Introducing Developer Talks (DevTalks for short). From the DevTalks frontpage:

Would you like to learn something new, meet like-minded individuals, or give back to the developer community? If so, you’re in the right spot. DevTalks is a developer driven community whose goals are to promote the sharing of knowledge and encourage collaboration among developers regardless of their background or level of expertise.

My plan is to round up some of the more motivated students and local developers in the area and throughout the semester, shake out a couple engaging 1-2 hour talks that can be followed by 1-2 hour small breakout sessions where attendees will work together to improve upon or re-implement what they’ve learned together. I believe that once students see their peers leading talks and working with local developers they will become more inclined to participate.

There are two primary benefits I see coming from this. First, students have a lot to gain from learning from and working with developers. Many questions that simply can’t be answered in the class room can be resolved on the spot. And, let’s not forget, there is just no replacement for real experience. Second, local developers and managers can use these meetings to spot potential interns and employees, locally. Anyone can submit a resume. However, leading one of these talks can demonstrate a much deeper level of understanding and leadership potential than can ever be conveyed on a sheet of paper.

So far, I have received very positive feedback from my peers as well as local developers. Several people wanting to give talks have already reached out and a local coworker space, Studioboro, has offered to host our first talk on October 1st.

If you would like more information please visit the landing page. If you would like to attend a talk, give a talk, donate (if you feed them they will come), or just share your insights please feel free to email me!

“Be the change you wish to see.” ~ Mahatma Gandhi

-H.

Tagged , , , ,

Working the storage scalability problem

The environment and our problem

As a thought experiment let us pretend that we have a website that hosts user projects. Users are allowed to upload files to their projects. Individual files are limited to 100 MB but, there are no file type restrictions. Some time later, our users start asking for version control support. We note that the majority of files submitted by our users are either text files or source code and decide that Git is a good option for implementing a version control system.

Fast forward another couple of months. Our user base has continued to grow steadily and their projects are getting larger and larger. We realize that we have a problem; our server is going to run out of local disk space soon. What’s the best work around for this while sticking to our shoestring budget? That is a very good question.

Is there a simple solution?

When I was asked  to think of a solution, my mind immediately jumped to Amazon Web Services. Specifically, moving our server to an Elastic Compute Cloud (EC2) instance connected to an Elastic Block Store (EBS) volume for storage. Combining the power and resource scalability of an EC2 instance with an easy to use and fast EBS volume could solve the problem. However, after a bit of reflection some very big issues started to arise.

EBS volumes can only be mounted by one EC2 instance at a given time.

Mounting to a single EC2 instance makes scaling our EC2 setup horizontally problematic. If we need to have several EC2 instances communicating with the EBS volume, we have to setup a middle man to act as a go between from the EC2 instances doing the work and the EBS volume storing our Git repositories. So what, right? Well, now we have the added expense and maintenance (however minimal they may be) of another EC2 instance and we’ve introduced a single point of failure. But, those aren’t our only problems.

There is no way to easily resize EBS volumes.

Let’s assume for a second we went ahead and implemented the EC2 with an EBS volume setup. We have created a 500 GB EBS volume and mounted it to our EC2 instance. All is well for another few months until our estimates show that our Git repositories will surpass 500 GB within two weeks. In order to increase the size of our EBS volume we have to create an additional EBS volume at a new set size, let’s say 1 TB, mount it to our EC2 instance, copy everything over from the original EBS volume, finally we can destroy the old EBS volume.  Suddenly I’m having flashbacks to my professor discussing how to resize arrays in my data structures class. And don’t forget about that shoestring budget we’re on.

At $0.10 per GB we’ve just increased our price from $50/month to $100/month for space that we aren’t using. Furthermore, we don’t even know when it will be used and as soon as we approach the 1 TB mark, it will be time to start the process all over again. If we follow the same model of doubling our size, we will again double our costs for unused space. Why not increase the size of our EBS volume in smaller amounts more frequently? It doesn’t solve the problem; we are still purchasing unused space while adding complexity to our system architecture.

Is there an alternative solution? Perhaps a solution that will alleviate our bottleneck and resizing issues? There might be. And it could be much, much cheaper.

Introducing Amazon Simple Storage Solution (S3)

To quote Amazon, “S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web.” Basically, S3 provides us with a way to read, write, and delete files of almost any size (up to 5 TB per file) for very cheap. At a mere $0.076 per GB for the first 49 TB, and even less if we want more storage, we see some solid savings. Before adding our input/output costs,  we have just cut costs to 3/4 of what an equivalent EBS volume would have run us.

What about resizing?

S3 is truly scalable. There is no limit to how large our S3 account can become or how many files we can store in it. Furthermore, we never need to worry about logging into our AWS account and increasing our allotted size, or all of this nonsense regarding copying data from one volume to another; this is all done automatically allowing us to focus on what’s important, our data.

Okay, and the bottleneck?

Again, our problems have been resolved. There is no need to have a middleman connecting the S3 instance to our servers. We can access S3 from anywhere, on any server, through API calls. My language of choice is Python so I have to give a shout out to boto. The good folks that developed boto have provided us with an incredibly simple way to interact with S3. Don’t take my word for it though, Amazon wrote its own tutorial on how to use it.

What’s the catch?

There’s always a catch, right? Right. S3 isn’t exactly designed to work in the way we want it to for our problem. Unlike a storage device such as a hard drive, usb stick, or even an EBS volume, there is no file system. In the simplest terms possible, we can think of S3 as a file dump like Google Drive or Dropbox. Only, instead of uploading files through your browser or desktop application, we access it via API calls. These uploaded files are stored in “buckets,” which can be thought of as directories in an abstract sense. How is this a problem?

Well, let’s say we want to store a Git repository on S3 that looks like:

    /.git/
    /.git/branches/...
    /.git/hooks/... 
    ...

In short, we can’t. There are some solutions using file naming conventions that can be interpreted as directory structures but to be quite frank, they seem very “hackish.” I don’t like “hackish” solutions in production environments.

A possible workaround

I did some digging around and found that, as is always the case, I wasn’t the first person to try and store Git repositories on S3. Some very smart people took a very interesting approach using a Filesystem in Userspace (FUSE).  S3FS or, “Fuse over Amazon” allows us to mount S3 buckets as a local filesystem from which we can read and write to transparently. This sounds pretty cool but just how fast can a fake file system that has been placed on S3 and mounted locally be? Let’s find out.

Benchmarking S3FS with Git

Consistency concerns with S3

S3 has an eventual consistency model. This term means that after writing, updating, or deleting a file it will eventually be viewable from all parties. However, there is no time limit regarding how long this may take. My personal experience has shown changes are generally visible within a second or two.

In 2011, EU and US-west region S3 instances implemented Read-After-Write consistency. This means that changes to writes should be visible to all parties immediately after the write has occurred. Our use case for S3 is: Create a filesystem with S3 on top of which Git repositories can be created letting buckets represent the top level directories of our repositories.

Git very rarely deletes something. Instead it either adds a new hash and snapshot or a pointer to the previous, unmodified version of a file. As a result, I chose to use us-west, at the cost of slightly more expensive rates. Time to benchmark.

The setup

I conducted all of my benchmarks with Ubuntu 12.04. Identical setups were installed on a VirtualBox, running locally, as well as an EC2 instance. I will spare you the installation details. If you would like to have specific installation instructions ,I would recommend checking out this blog post.

After installing S3FS and setting up a test buckets, I created the following file structure:

    /test1.py
    /test2.py
    /seconddir/
    /seconddir/test3.py

Then, each of the following actions was executed 10 times:

  1. initializing the git repository
  2. adding all files to the repository
  3. committing the new changes to the repository

Surprisingly poor results

WITH caching FROM local machine ON Oregon
  • Git init times: 130s
  • Git add times: 15.5s
  • Git commit times: 38.4s

WITH caching FROM ec2 instance ON Oregon

  • Git init times: 103.2s
  • Git add times: 12.2s
  • Git commit times: 36.7s

WITH caching FROM local machine ON Oregon AND max_stat_cache_size=10000000

  • Git init times: 138s
  • Git add times: 16s
  • Git commit times: 39s

WITH caching FROM ec2 instance ON Oregon AND max_stat_cache_size=10000000

  • Git init times: 105.8s
  • Git add times: 12.4s
  • Git commit times: 37.7s

As you can see, initialization, add, and commit times were terribly slow. Although operations ran on the EC2 instance were noticeably faster than those ran on my local machine, they are still nowhere near quick enough. On the bright side, pulling files from S3 was rather fast. Every pull with the tiny repositories in my example took less than one second. But, the slow write times weren’t the only things I found to be a bit troubling.

During this process I also learned that S3FS mounts required root access, mounts were not always stable, and they would inconsistently disconnect. Unfortunately, this idea doesn’t seem to be a workable solution either.

Other Solutions

Jgit

Jgit is a Java implementation of Git-scm and is hosted by eclipse – an open source Java IDE. I found this article describing how to use Jgit to store repositories on S3 using Jgit. However, I have no experience with Jgit or its community and this makes me hesitant to use it without a lot more research.

Caching?

Remember how I said that pulling files from S3 was pretty fast? Actually, it was so strikingly fast compared to writing that it stuck in my head for a while and got me thinking about an alternative solution. What if we used the server’s local storage as a caching layer which works with our Git repositories until the project no longer needs immediate access? We could then use S3 as an external storage device for storing neatly packed repositories until they are needed?

That sounds like an interesting, fun project; more to follow. Time to do some more brainstorming.

"If at first you don't succeed, 
Try, try, try again." - William Edward Hickson

-H.

GitHub offers free student accounts?!

Last fall I started using GitHub to help me maintain and disseminate projects I work on. Not sure what GitHub is? From it’s main page, GitHub is a “Powerful collaboration, code review, and code management for open source and private projects.” 

Built on top of the incredibly powerful version control system Git, GitHub provides users a friendly, intuitive wrapper for Git’s somewhat confusing system on the web (or the desktop!). 

I could go on and on about how much I love Git or how GitHub has made working with peers/co-workers much less stressful, code reviews a pleasure, and handling complicated branches a breeze. It is pretty much awesome, right? Right. But it just got even MORE awesome!

One of my co-workers informed me she had signed up for a free student upgrade with GitHub. By default anyone can sign up and maintain as many public repositories as they like. However, in order to maintain private repositories one would need to sign up for at least a micro plan. Unaware of a student discount and always in search of a deal (every penny counts as a college student!) I wrote the GitHub support staff a quick email about my recent revelation.

In less than 24 hours the very friendly (and humorous) folks at GitHub had replied and upgraded my account to a free micro plan for two years! That’s a total savings of $168 (assuming prices don’t go up between now and then). Only if more companies were as student friendly. I give a tip of my hat to the folks at GitHub. I will be sure my local ACM chapter, peers, and developer friends know about this.

For more information about GitHubs student plan you can visit: https://github.com/edu

Additionally, I highly recommend reading more about Git itself. The good people who created it were kind enough to post a detailed, easy-to-read, and free book online here. If you are a book in hand kind of person, you can grab one from Amazon too!

Thanks, GitHub!

-H.

Tagged , ,

First Steps with Celery: How to Not Trip

Recently, I was tasked with integrating a task queue into a web framework at work. For the purpose of this post, I would like note that I am operating with Python 2.7.5, Flask 0.9, Celery 3.0.21, and RabbitMQ 3.1.3. This post was written using IPython 0.13.2 in an IPython notebook.

Now, I’ve never implemented a task queue before and boy did that ever make this difficult. A quick search result showed that Celery was the main player in the Python task queue arena.

Before diving into the code base at work I set up a virtualenv and followed Celery’s First Steps with Celery tutorial. It was easy, as was the Next Steps tutorial. I would go so far as to say they were too simple. When I went to apply my freshly earned skills to my code base I ran into a series of walls. Unfortunately, I didn’t have any luck pinging either Celery’s irc channel #celery or their Google group.

But, eventually I figured it out. I’m writing this so that you will (hopefully) avoid similar frustrations. Enjoy!

Picking a Broker

Celery requires a message broker. This broker acts a middleman sending and receiving messages to Celery workers who in turn process tasks as they receive them.

Celery recommends using RabbitMQ. I opted for this as my knowledge in this area is limited and assumed they would likely have the most thorough and robust documentation for it.

Installing RabbitMQ in Ubuntu is easy:

    $ sudo apt-get install rabbitmq-server

Installing it on a mac was also rather simple:

    $ brew update
    $ brew install rabbitmq

    # update your path in ~/.bash_profile or .profile with
    PATH=$PATH:usr/local/sbin

Note: A co-worker ran into issues installing RabbitMQ via homebrew. To resolve this he followed the standalone mac installation instructions here.

Once installed, starting the server is as simple as:

    $ rabbitmq-server
    # or you can start in the background with
    $ rabbitmq-server -detached

And you can stop it with:

    $ rabbitmqctl stop

Installing Celery

Installing Celery was very simple. From within your virtualenv (you should be using virtual environments!):

    $ pip install celery

Setting up Celery config, Celery daemon, and adding ‘tasks’

The steps below are bit more convoluted than the aforementioned tutorial provided by the Celery team. This is meant to be more of a comprehensive ‘real world’ example. If you would like something simpler please go here

Project Structure:

    project/
    project/celeryconfig.py
    project/framework/celery/celery.py
    project/framework/email/email_tasks.py

Celery config — celeryconfig.py

    # config file for Celery Daemon

    # default RabbitMQ broker
    BROKER_URL = 'amqp://'

    # default RabbitMQ backend
    CELERY_RESULT_BACKEND = 'amqp://'

There are a couple of things to note here. First, we are using RabbitMQ as the broker and the backend. Wait, what is the backend? The backend is the resource which returns the results of a completed task from Celery. Second, you may be wondering what amqp is. amqp is a custom protocol that RabbitMQ utilizes. More information on it can be located here.

More information on celery configuration and defaults can be found in the Celery docs.

Celery daemon: Preparing our daemon — celery.py

    from __future__ import absolute_import

    from celery import Celery

    # instantiate Celery object
    celery = Celery(include=[
                             'framework.email.email_tasks'
                            ])

    # import celery config file
    celery.config_from_object('celeryconfig')

    if __name__ == '__main__':
        celery.start()

The two commented portions here can be a bit confusing.

    celery = Celery(include=[
                             'framework.email.email_tasks'
                            ])

Here we are instantiating a Celery object and handing it a list containing the relative (to where you start your Celery daemon!) path to all modules containing Celery tasks.

    celery.config_from_object('celeryconfig')

Next, we are telling that newly instantiated Celery object to import its configuration settings from celeryconfig.

Headache Number One: Celery and relative imports

I’m sad to admit that it look me 15 minutes figure out why I didn’t need celeryconfig.py in the same directory as my celery.py. So, read this and learn from my stupid mistake.

Again, I want to emphasize everything is relative to where the Celery daemon is launched.

  • Our Celery daemon will be launched from /
  • Because the config file is located at /celeryconfig.py
  • The daemon looks for the config file in the root: celeryconfig
  • Additionally the module containing tasks is located several directories deep: /framework/email/email_tasks.py
  • So the daemon thinks the email_tasks.py is located several directories deep framework.email.email_tasks

Creating a task: Let’s queue up some emails! — email_tasks.py

    from email.mime.text import MIMEText

    def send_email(to=None, subject=None, message=None):
        """sends email from hairycode-noreply to specified destination

        :param to: string destination address
        :param subject: subject of email
        :param message: body of message

        :return: True if successful
        """
        # prep message
        fro="hairycode-noreply@hairycode.org"
        msg = MIMEText(message)
        msg['Subject'] = subject
        msg['From'] = fro 
        msg['To'] = to

        # send message
        s = smtplib.SMTP('mail.hairycode.org')
        s.ehlo()
        s.starttls()
        s.ehlo()
        s.login('YOUR_USERNAME', 'YOUR_PASSWORD')
        s.sendmail('hairycode-noreply@hairycode.org, [to], msg.as_string())
        s.quit()
        return True

Making this function into a task is as simple as importing our Celery object and adding a decorator (almost).

Recall that when we instantiated our Celery daemon we handed it a list of relative paths. One of those was to this file ‘framework.email.email_tasks’. When Celery is started it will comb over any files in that list and look for

    @celery.task

So, let’s go ahead and modify our function to meet the spec.

    from email.mime.text import MIMEText

    # import our Celery object
    from framework.celery.celery import celery

    # add the decorator so it knows send_email is a task
    @celery.task
    def send_email(to=None, subject=None, message=None):

    # code removed for brevity

If everything else is in order your app will be able to add these to the Queue by either calling the .delay() or .apply_async() functions. But, before we can do that let’s make sure our RabbitMQ server and Celery daemon are up and running.

Testing Our New Task

Launch RabbitMQ

Launch your RabbitMQ server in the background from the shell

    $ rabbitmq-server -detached

You can ensure it’s running the background by inspecting your processes

    $ ps aux | grep rabbit --color

Which should yield three things

  1. A very, very long output (this is the rabbitmq-server we just launched)
  2. The RabbitMQ daemon always running silently“hairycode 27491 0.0 0.0 599680 156 ?? S 5:24PM 0:00.33 /usr/local/Cellar/rabbitmq/3.1.3/erts-5.10.1/bin/../../erts-5.10.1/bin/epmd -daemon”
  3. And, the grep command you just executed“hrybacki 35327 1.2 0.0 2432768 596 s000 S+ 2:25PM 0:00.00 grep rabbit –color”

Note: If you see one or more additional of the “long” processes running you will run into issues. If this is the case stop all RabbitMQ servers

    $ rabbitmqctl-stop

and start over. I will provide an example of what can go wrong if there are multiple brokers or Celery daemons running at once.

Launch the Celery daemon

From the project/ directory launch the Celery daemon

    $ celery -A framework.celery.celery worker -l debug

which should give you a daemon monitor without put along the lines of

     -------------- celery@Harrys-MacBook-Air.local v3.0.21 (Chiastic Slide)
    ---- **** ----- 
    --- * ***  * -- Darwin-12.4.1-x86_64-i386-64bit
    -- * - **** --- 
    - ** ---------- [config]
    - ** ---------- .> broker:      amqp://guest@localhost:5672//
    - ** ---------- .> app:         __main__:0x10f5355d0
    - ** ---------- .> concurrency: 4 (processes)
    - *** --- * --- .> events:      OFF (enable -E to monitor this worker)
    -- ******* ---- 
    --- ***** ----- [queues]
     -------------- .> celery:      exchange:celery(direct) binding:celery

    [Tasks]
      . framework.email.email_tasks.send_email

    [2013-07-23 15:46:55,342: DEBUG/MainProcess] consumer: Ready to accept tasks!

-A framework.celery.celery worker

informs Celery which the app instance to use and that it will be creating workers. Workers take tasks from the queue, process them, and return the result to the message broker.

-l debug

tells Celery that you want it to display log level debug output for testing purposes. Normally you would execute -l info for a log level info output.

Now, let’s make sure we have some Celery workers up and running

    $ ps aux | grep celery --color

Note the concurrency number when we launched the Celery daemon. This is the number of processors and in turn workers which should have been launched. The grep output from the previous command should leave you with that many outputs similar to

    hairycode       37992   0.1  0.4  2495644  33448 s001  S+    3:20PM   0:00.74 /Users/hairycode/git/staging-celery/venv/bin/python /Users/hairycode/git/staging-celery/venv/bin/celery -A framework.celery.celery worker -l debug

Detailed information about launching the Celery daemon can be found here or from the shell

    $ celery --help

Testing with IPython

Note: I am using IPython from the root directory in the code segment below. You could just as easily, well maybe not easily, use the standard Python interpreter or write a test script in Python. But, IPython is awesome. I like awesome things.

Executing our Task

    # import celery
    import celery

    # import our send_email task
    from framework.email.email_tasks import send_email

    # call our email function
    result = send_email.delay('', 'all your smtp are belong to us', 'somebody set up us the bomb')

    type(result)

If you look at your Celery daemon you can see the task coming in, being processed, returning the result, and even how long it took to execute. For example the call above gave me the following output

    [2013-07-23 15:48:29,145: DEBUG/MainProcess] Task accepted: framework.email.email_tasks.send_email[09dad9cf-c9fa-4aee-933f-ff54dae39bdf] pid:39336
    [2013-07-23 15:48:30,600: DEBUG/MainProcess] Start from server, version: 0.9, properties: {u'information': u'Licensed under the MPL.  See http://www.rabbitmq.com/', u'product': u'RabbitMQ', u'copyright': u'Copyright (C) 2007-2013 VMware, Inc.', u'capabilities': {u'exchange_exchange_bindings': True, u'consumer_cancel_notify': True, u'publisher_confirms': True, u'basic.nack': True}, u'platform': u'Erlang/OTP', u'version': u'3.1.3'}, mechanisms: [u'AMQPLAIN', u'PLAIN'], locales: [u'en_US']
    [2013-07-23 15:48:30,601: DEBUG/MainProcess] Open OK!
    [2013-07-23 15:48:30,602: DEBUG/MainProcess] using channel_id: 1
    [2013-07-23 15:48:30,604: DEBUG/MainProcess] Channel open
    [2013-07-23 15:48:30,607: INFO/MainProcess] Task framework.email.email_tasks.send_email[09dad9cf-c9fa-4aee-933f-ff54dae39bdf] succeeded in 1.46279215813s: True

some_task.delay() vs some_task.apply_async()

some_task.delay() is a convenient method of calling your function as it looks like a regular function. However, it is short hand for calling some_task.apply_async(); apply_async() is a more powerful and flexible method for calling your tasks. Detailed information on both can be located here.

Executing our task — more realistically

The AsyncResult is the Celery object that the backend (RabbitMQ) returned after the worker (Celery) completed the task. The long string following it is the task_id. More often you won’t assign the function call to a variable. Doing so would hold up our app until the task had completed. That wouldn’t make much sense would it? Rather, you will simply call the delay or apply_async function and let your code continue on like this

    # import celery
    import celery

    # import our send_email task
    from framework.email.email_tasks import send_email

    # call our email function
    send_email.delay('', 'all your smtp are belong to us', 'somebody set up us the bomb')

Remember, we still have the task id. If you want to check the status or result of what we just submitted you can do so by asking the task queue

    # grab the AsyncResult 
    result = celery.result.AsyncResult('09dad9cf-c9fa-4aee-933f-ff54dae39bdf')

    # print the task id
    print result.task_id
    09dad9cf-c9fa-4aee-933f-ff54dae39bdf

    # print the AsyncResult's status
    print result.status
    SUCCESS

    # print the result returned 
    print result.result
    True

This is a very basic run down. If you want to much more detailed information on this I would recommend checking out the Calling Tasks section of Celery’s documentation.

Headache Number Two: My Celery daemon is only receiving every other task? Wat.

This little bug took me entirely too long to solve. At some point I started noticing that exactly half of the .delay() calls I was making were permanently in a state of PENDING.

For example, running this

    ###IPython output

    from framework.email.email_tasks import send_email

    send_email.delay('', 'all your smtp are belong to us', 'somebody set up us the bomb')

    send_email.delay('', 'all your smtp are belong to us', 'somebody set up us the bomb')

Gave the following output from the Celery daemon

    [2013-07-22 18:18:44,576: DEBUG/MainProcess] Task accepted: tasks.test[0e55bfed-1f05-4700-90fe-af3dba34ced5] pid:7663
    [2013-07-22 18:18:44,583: DEBUG/MainProcess] Start from server, version: 0.9, properties: {u'information': u'Licensed under the MPL.  See http://www.rabbitmq.com/', u'product': u'RabbitMQ', u'copyright': u'Copyright (C) 2007-2012 VMware, Inc.', u'capabilities': {u'exchange_exchange_bindings': True, u'consumer_cancel_notify': True, u'publisher_confirms': True, u'basic.nack': True}, u'platform': u'Erlang/OTP', u'version': u'2.8.4'}, mechanisms: [u'PLAIN', u'AMQPLAIN'], locales: [u'en_US']
    [2013-07-22 18:18:44,585: DEBUG/MainProcess] Open OK!
    [2013-07-22 18:18:44,585: DEBUG/MainProcess] using channel_id: 1
    [2013-07-22 18:18:44,586: DEBUG/MainProcess] Channel open
    [2013-07-22 18:18:44,589: INFO/MainProcess] Task framework.email.email_tasks.send_email[0e55bfed-1f05-4700-90fe-af3dba34ced5] succeeded in 2.0180089473724s: True

0e55bfed-1f05-4700-90fe-af3dba34ced5 was there but af3846a9-4a31-4a8d-99a4-0d990d51ef22 wasn’t.

I restarted my Celery daemon. Same thing.

I restarted my RabbitMQ server. Same thing.

I created an entire new project and followed the First Steps with Celery docs. Same thing.

Confused I searched around but I could only find one other person who had encountered something similar and that issue was over a year old. Note: I tried his solution but it didn’t resolve the issue.

The trick was that somewhere along the line I had another set of Celery workers running in the background that were not part of the daemon I had just started running. These workers were taking tasks from the queue and I wasn’t getting them back. I was able to recreate the same bug by having a second instance of RabbitMQ server running.

Remember when I told you to ensure you only had one RabbitMQ server and the correct number of concurrent Celery workers running by checking your processes? This is why. Don’t do this.

Let’s Improve our Setup

Adding logs

Adding logs was pretty straightforward. First, we need to modify our celeryconfig.py to specify where we want our logs:

    # celeryconfig.py

    # default RabbitMQ broker
    BROKER_URL = 'amqp://'

    # default RabbitMQ backend
    CELERY_RESULT_BACKEND = 'amqp://'

    # specify location of log files
    CELERYD_LOG_FILE="/path/to/your/logs/celery.log"

Now, we implement logging within the task itself.

After importing the required function, we grab the logger associated with our Celery app

    logger = get_task_logger(__name__)

Then, at the desired point log a custom message to log level info. Note: If you desired to log to another level e.g. debug you would use logger.debug(…)

    logger.info('Sending email from: %r, to: %r' % (fro, to))

The resulting email_tasks.py looks like:

    from email.mime.text import MIMEText
    from framework.celery.celery import celery
    # import the Celery log getter
    from celery.utils.log import get_task_logger

    # grab the logger for the Celery app
    logger = get_task_logger(__name__)

    def send_email(to=None, subject=None, message=None):
        """sends email from hairycode-noreply to specified destination

        :param to: string destination address
        :param subject: subject of email
        :param message: body of message

        :return: True if successful
        """
        # prep message
        fro="hairycode-noreply@hairycode.org"
        msg = MIMEText(message)
        msg['Subject'] = subject
        msg['From'] = fro 
        msg['To'] = to

        # log desired message to info level log
        logger.info('Sending email from: %r, to: %r' % (fro, to))

        # send message
        s = smtplib.SMTP('mail.hairycode.org')
        s.ehlo()
        s.starttls()
        s.ehlo()
        s.login('', '')
        s.sendmail('hairycode-noreply@hairycode.org, [to], msg.as_string())
        s.quit()
        return True

And that’s it! After implementing logging, tasks should be adding your messages to their respective log files e.g.:

    [2013-07-23 15:48:29,145: INFO/MainProcess] Sending email from: hairycode-noreply@hairycode.org, to: 'testymctester@test.com'

Conclusion

Learning Celery has been… frustrating. The above examples barely begin to scratch the surface of what it’s capable of. It is an incredibly powerful and configurable tool I would however, like to see a more responsive community but, I understand we all busy people. Queuing tasks is a necessity for any major application and I’m beginning to develop a love-hate relationship with Celery. More to follow?

-H.

Tagged , , , , , ,

SciPy 2013: Day Four/Five

Day Four: Talking, myself and so many others talking…

I’ve been to several poster sessions since going back to school two summers ago. However, this was the first session that I presented in. My poster was for Scholarly. Scholarly, is a personal research project involving the network of scholarly citations that started off as a flop and has grown into a little monster. Feel free to read more about it on my current projects page, visit the GitHub repository, or download the poster. It was nice to see so many people interested in the project. Almost everyone instantly realized the importance this dataset will have and many asked the question, when can we access the data? Unfortunately, I don’t have a solid answer to that question — yet. But, we are making progress and our data collection servers should be on the web soon! Now, on to what other people spoke about.

Thursday was well, more talks. By the end of the day I was quite exhausted. While, many of them were incredibly interesting I guess I’ve discovered my limit for sitting and listening to people while I clack away at a computer — three days.

One talk I found of particular interest was Analyzing IBM Watson experiments with IPython Notebook by Torsten Bittner from IBM. I won’t go into all of the details but, the Watson team was able to take ~8,350 SLOC written in Java used for training and testing Watson to ~220 SLOC in an IPython notebook. And to make this even more impressive, it ran faster in the notebook. And with that, I’ll move into day five.

Day Five: Let the sprints commence!

Sprints? At a software conference? If you’re picturing 100 sweaty programmers running down hallways for fresh coffee you’re a bit off. There is coffee and occasionally sweat. But, very, very little running.

The sprints are an opportunity for open source projects to get people involved regardless of their skill level. It’s a pretty cool experience. You’re surrounded by programmers from all backgrounds, some legendary, most you’ve never heard of before, but all of whom are very approachable and patient.

After attending the morning pre-sprint session all of the projects distributed themselves in various rooms and went to work. I decided to try something a bit different today.

With all of this ranting and raving I’ve been doing about IPython notebooks I decided I needed to see what I could do with it. Here is the result. I was able to successfully integrate the notebook while benchmarking some search queries in a MongoDB instance populated with fake citation data. It wasn’t cumbersome to do so and documenting as I went along was actually quite enjoyable.

Time for a beer.

-H.

Tagged , , ,
Follow

Get every new post delivered to your Inbox.

Join 362 other followers