Deploying an OpenStack undercloud/overcloud on a single server from my laptop with Ansible.

During the summer of 2014 I worked on the OpenStack Keystone component while interning at Red Hat. Fast forward to the end of October 2015 and I once again find myself working on OpenStack for Red Hat — this time on the RDO[1] Continuous Integration (CI) team. Since re-joining Red Hat I’ve developed a whole new level of respect not only for the wide breadth of knowledge required to work on this team but for deploying OpenStack in general.

The list of deployment options for OpenStack[2a,b,c,d] is long and has a colorful history. Furthermore, there are probably only a handful of people who have developed and or used more than any given handful. I have the fortune of working with, or at least within the proximity of, several of these folks. However, even with that advantage, I find wrapping my head around all of the moving parts involved in deploying OpenStack confusing. This was a prime source of frustrations during my first several weeks back at Red Hat. Routinely I found myself having to accept the magic of a given deployment tool in order move forward with my tasks.

Presently, the RDO CI team uses Khaleesi[3][4], and by proxy Ansible, as one of the deployment tools for automating builds with Jenkins[5]. During this walkthrough I will follow the deployment steps as outlined in Khaleesi’s cookbook[6] to deploy OpenStack using RDO-Manager[7] on a single, baremetal CentOS 7.2 server.

Assumptions:

  1. One machine is set up[8] as the controller from which you will generate the necessary Ansible configuration and execute the appropriate playbooks within Khaleesi. In my case this is a ThinkPad X1 running Fedora 22 — my work laptop.
    1. Note: Make sure you follow all of the steps in the Khaleesi setup guide on the controller or you will run into problems when trying to use ksgen or Ansible.
  2. One machine with a minimum of 1 quad core CPU, 12 GB of memory, and 120 GB of free space, as outlined by the RDO-Manager docs[9], that is running a clean install of CentOS 7.2
    1. Note: Because our installer, RDO-Manager, is based off of TripleO[2d] we do in fact need to use a baremetal machine for this. I couldn’t find a detailed explanation specifying the exact reasons anywhere but the limitation is noted in both the TripleO docs[c] as well as the Khaleesi docs[10]

My setup meeting the two above requirements looks looks like this:

Screenshot from 2016-01-11 17-57-38

Configuration:

Okay, let us move to the actual setup. We will be picking up at the Configuration[11] portion of the Khaleesi cookbook. Remember that all of these commands are to be run from the machine you installed Khaleesi on — the Thinkpad X1 in my case.

    cp ansible.cfg.example ansible.cfg
    touch ssh.config.ansible
    echo "" >> ansible.cfg
    echo "[ssh_connection]" >> ansible.cfg
    echo "ssh_args = -F ssh.config.ansible" >> ansible.cfg

We begin by copying over the Ansible config in version control, with some defaults needed across Khaleesi use cases, and then telling Ansible to use the config file that will be generated by Khaleesi, ssh.config.ansible

    ssh-copy-id root@<ip address of baremetal virt host>  # x.x.x.49 in my example

ssh-copy-id allows us to easily transfer your ssh keys to the CentOS box. This tool has quickly become one of my gotos as I am constantly provisioning systems and removes many of the possible human errors involved in key transfers.

    export TEST_MACHINE=<ip address of baremetal virt host>  # x.x.x.49 in my example

The playbook we will end up calling, khaleesi/playbooks/full-job-no-test.yml will expect that the TEST_MACHINE environment variable has been set and will use it while generating the hosts file used by Ansible.

    ksgen --config-dir settings generate \
        --provisioner=manual \
        --product=rdo \
        --product-version=liberty \
        --product-version-build=last_known_good \
        --product-version-repo=delorean \
        --distro=centos-7.0 \
        --installer=rdo_manager \
        --installer-env=virthost \
        --installer-images=build \
        --installer-network-isolation=none \
        --installer-network-variant=ml2-vxlan \
        --installer-post_action=none \
        --installer-topology=minimal \
        --installer-tempest=smoke \
        --workarounds=enabled \
        --extra-vars @../khaleesi-settings/hardware_environments/virt/network_configs/none/hw_settings.yml \
        ksgen_settings.yml

Note: If you see warnings similar to the line directly below, don’t worry. There is a set of defaults and it is simply informing you which it will be using if no respective parameter was handed to it when called.

    settings.py:105|       _load_defaults() | WARNING: '--installer-network' hasn't been provided, using 'neutron' as default

Remember that tool we installed from within the Khaleesi repository? That was ksgen, a tool that generates a file, ksgen_settings.yml in our case, which contains most of the variables used by Ansible during the execution of Khaleesi’s playbooks. The parameters above line up with files underneath khaleesi/settings and pull in the variables respectively while magically handling any conflicts that may arise. For example, `–provisioner=manual` will include all variables located within khaleesi/settings/provisioner/manual.yml as well as khaleesi/settings/provisioner/common/common.yml as indicated by the include statement at the top of the aforementioned manual.yml.

This is a pretty basic setup. A few of the parameters are of particular note, namely:

    --provisioner=manual

We have provisioned the CentOS box ourselves as opposed to using say Beaker or Foreman (both of which are supported provisioners by Khaleesi)

    --installer=rdo_manager

RDO-Manager is our tool of choice here for the actual installation of OpenStack on our CentOS box.

    --installer-env=virt_host

Our undercloud/overcloud deployment will be installed on virtual machines running on the CentOS box, TEST_MACHINE. Accordingly, Khaleesi will need to use respective virthost playbooks — as opposed to baremetal playbooks were we to install our nodes on actual boxes. 

   --extra-vars @../khaleesi-settings/…

Once upon a time all of the settings files that reside underneath khaleesi/settings lived in another repository aptly named khaleesi-settings. It still exists, we use it internally for storing sensitive data needed for our CI infrastructure that we wouldn’t want public, and it retains some things like the virtual networking settings needed for ml2-vxlan argument passed to ksgen. Why exactly does khaleesi-settings still exist upstream? To be frank, I’m not quite sure but I’ll update this post when I have a rational answer.

The result of calling ksgen is a concise YAML file, ksgen_settings.yml — you can rename it whatever you want just be sure to pass it to your ansible-playbook calls accordingly. This file is infinitely useful and will quickly become your best friend whenever you have to troubleshoot failures with Ansible.

Deployment:

Now we are ready to call Khaleesi’s playbook khaleesi/playbook/full-job-no-test.yml which will provision TEST_MACHINE, which is minimal in our case as we’ve already manually done so, and then use RDO-Manager to deploy an undercloud and overcloud in virtual machines that are hosted on our CentOS box.

    ansible-playbook -vv --extra-vars @ksgen_settings.yml -i local_hosts playbooks/full-job-no-test.yml

If Ansible doesn’t throw an error within the first 10 seconds, indicating something is likely messed up in either you Ansible config file or in ksgen_settings.yml, feel free to go and stretch your legs as these playbooks can take awhile to finish up. The console output at the end of the playbooks execution should look similar to:

    PLAY [Global post install] ****************************************************
                        [[ previous play time: 0:00:02.259124 = 2.26s / 3836.33s ]]
    skipping: no hosts matched

    PLAY RECAP ********************************************************************
    host0                      : ok=126  changed=81   unreachable=0    failed=0
    localhost                  : ok=21   changed=7    unreachable=0    failed=0
    overcloud-cephstorage-0    : ok=1    changed=1    unreachable=0    failed=0
    overcloud-controller-0     : ok=2    changed=2    unreachable=0    failed=0
    overcloud-novacompute-0    : ok=1    changed=1    unreachable=0    failed=0
    undercloud                 : ok=123  changed=74   unreachable=0    failed=0

                        [[ previous task time: 0:00:00.029086 = 0.03s / 3836.35s ]]
                        [[ previous play time: 0:00:00.018960 = 0.02s / 3836.35s ]]
                 [[ previous playbook time: 1:03:56.350566 = 3836.35s / 3836.35s ]]
                       [[ previous total time: 1:03:56.350779 = 3836.35s / 0.00s ]]

You should now have a fully functional undercloud and overcloud running on TEST_MACHINE that is similar to the grossly simplified graphic below.
Screenshot from 2016-01-11 17-57-54

Conveniently, you can log directly into the undercloud from the root Khaleesi directory by using the ssh config generated by Khaleesi.

    ssh -F ssh.config.ansible undercloud
    Warning: Permanently added 'x.x.x.49' (ECDSA) to the list of known hosts.
    Warning: Permanently added 'undercloud' (ECDSA) to the list of known hosts.
    Last login: Wed Jan 13 18:33:22 2016 from gateway
    [stack@instack ~]$ ls

Cleanup:

Once you wrap up doing whatever it is you want to do with your new deployment, wiping out the overcloud and and performing cleanup is as simple as calling another of Khaleesi’s playbooks.

    ansible-playbook -vv --extra-vars @ksgen_settings.yml -i hosts playbooks/cleanup.yml

Final Thoughts:

Ansible, and Khaleesi, make it very easy to deploy OpenStack in a reproducible manner — if you have everything configured correctly beforehand. The vast majority of time I spend fixing problems while working with Khaleesi come down to mistakes related to configurations.

From the 40 or so lines we’ve entered into our shells an enormous number of subsequent actions have taken place through Khaleesi’s playbooks. I could spend days diving into each one of them. I’m sure I will eventually but it’s nice to know that I can do so as time permits me to do so thanks to Khaleesi and Ansible.

Things I’d like to write more about in the future:

  1. A more in depth breakdown of what is happening in each of the playbooks used during this deployment — or a deployment of a similar nature.
  2. Khaleesi’s purpose, history, and potential future.
  3. The product pipeline from OpenStack (upstream) -> RDO -> Red Hat OpenStack, aka RHOS, (downstream) and others.
  4. Anything you as an audience would like to hear more about related to my work.

 

[1] – https://www.rdoproject.org/rdo/faq/

[2a] – Devstack: http://docs.openstack.org/developer/devstack/

[2b] – Staypuft: https://github.com/theforeman/staypuft

[2c] – Packstack: https://wiki.openstack.org/wiki/Packstack

[2d] – TripleO: https://wiki.openstack.org/wiki/TripleO

[3] – https://github.com/redhat-openstack/khaleesi

[4] – http://khaleesi.readthedocs.org/en/master/khaleesi.html

[5] – https://jenkins-ci.org/

[6] – http://khaleesi.readthedocs.org/en/master/cookbook.html

[7] – https://repos.fedorapeople.org/repos/openstack-m/rdo-manager-docs/liberty/

[8] – http://khaleesi.readthedocs.org/en/master/khaleesi.html#prerequisites

[9] – https://repos.fedorapeople.org/repos/openstack-m/rdo-manager-docs/liberty/environments/virtual.html#minimum-system-requirements

[10] – http://khaleesi.readthedocs.org/en/master/cookbook.html#installation

[11] – http://khaleesi.readthedocs.org/en/master/cookbook.html#configuration

 

Tagged , , , , , ,

Graduation (Yester)Day

Last night I received a neat little email from the University of North Carolina at Greensboro. I’ll leave out the verbose output from legal and finance and get to the fun bits that in short read:

Dear Harry

On behalf of the University Registrar’s Office, it is my pleasure to inform you that we have completed the final processing of your academic record.

Congratulations, all degree requirements have been met as of the posted graduation date, December 10th, 2015.

What a thrill it is to know that I have finally completed that damned degree. It was about a year ago that I decided to not return to school in January for the lone remaining class I required, Biology 101 (why is it always a gen. ed. class?). Instead I rode off to the great, mountainous North that is Charlottesville, Virginia.

Living in what will possibly be my favorite apartment ever, I worked as a web application developer for a small non-profit that I had previously interned with — CoS (the Center for Open Science). To them and in particular Dr. Jeffrey Spies and Dr. Joshua Carp I will be forever grateful. The slightly less than a year that I spent in that cozy, gorgeous city was one of great professional and personal development. But all states in life, aside from death I suppose, are transitionary and in October I moved back down to the warm and humid North Carolina to what has been one of my favorite cities of all time — Raleigh.

Too me Raleigh is bikes, beer, coding, music, and friends. More than that, Raleigh is the perfect blending of them in an active city that is neither too large nor too small. It is, at least for now, a good base of operations and one which I want to leave better off than I found it. I don’t know how long I’ll be here. I don’t know how long I’ll be at Red hat. But, I do know that I’m happy to have completed my undergraduate degree, be actively contributing to the open source community professionally,  and living in a city that feels like home.

I suppose I should change the blog description now.

Tagged , , , ,

API Testing with Flask: Post

Have you ever tried to test POST requests to an API written with Flask? Today I was going through an old code base writing unittests that should have been written eons ago when I ran into a snag. The issue stemmed when I tried manually setting the headers Content-Type.

But before we get to that, let me show you the skeleton of the Flask route I was testing:

@app.route('/someendpoint/' methods=['POST'])
def some_endpoint():
    """API endpoint for submitting data to

    :return: status code 405 - invalid JSON or invalid request type
    :return: status code 400 - unsupported Content-Type or invalid publisher
    :return: status code 201 - successful submission
    """
    # Ensure post's Content-Type is supported
    if request.headers['content-type'] == 'application/json':
        # Ensure data is a valid JSON
        try:
            user_submission = json.loads(request.data)
        except ValueError:
            return Response(status=405)
        ... some magic stuff happens
        if everything_went_well:
            return Response(status=201)
        else:
            return Response(status=405)

    # User submitted an unsupported Content-Type
    else:
        return Response(status=400)

Ignoring the magic, everything seems in order. So, lets go ahead and write a quick unittest that posts an invalid json.

    def test_invalid_JSON(self):
        """Test status code 405 from improper JSON on post to raw"""
        response = self.app.post('/someendpoint',
                                data="This isn't a json... it's a string!")
        self.assertEqual(response.status_code, 405)

Cool, let’s run it!

    Failure
    self.assertEqual(response.status_code, 405)
    AssertionError: 400 != 405

A quick glance at the code and I realize I’ve forgotten to set the headers Content-Type! Wait. How do I set the Content-Type? That’s a good question. After searching around for people who had run into similar problems this is what I came up with:

    def test_invalid_JSON(self):
        """Test status code 405 from improper JSON on post to raw"""
        response = self.app.post('/someendpoint',
                                data="This isn't a json... it's a string!",
                                headers={'content-type':'application/json')
        self.assertEqual(response.status_code, 405)

Let’s try this again.

    Failure
    self.assertEqual(response.status_code, 405)
    AssertionError: 400 != 405

Hmmm, okay. Next, I decided to inspect the request coming into the flask app and found something odd in request.headers:

    Host: localhost
    Content-Length: 10
    Content-Type: 

Why is the Content-Type empty? Another search gave hinted at another possible solution. Why not just build the headers dict inline?

    headers=dict(content-type='application/json') # But that's not right. We can't have '-' in a key.

By this point I’ve become agitated. Neither the Flask docs themselves nor various forums have been of much use. Then I stumbled across this.

Perhaps I missed something in the docs. Either way, I learned that you can hand the Content-Type as a parameter to the post method. It works and it looks much cleaner. Let’s revise my initial unittest accordinlgy:

    def test_invalid_JSON(self):
        """Test status code 405 from improper JSON on post to raw"""
        response = self.app.post('/raw',
                                data="not a json",
                                content_type='application/json')
        self.assertEqual(response.status_code, 405)

And, let’s run this one last time and look at the request as it comes though.

    Tests passed
    Host: localhost
    Content-Length: 10
    Content-Type: application/json

Much better. Now, back to testing!

-H.

Tagged , , , , ,

Resumes…

Currently, I am attempting to trim my CV down to a length appropriate for a resume. To be honest, it’s not as easy as I would have thought. Wouldn’t it be more simple if a resume could just simply read the following:

Favorite language: Python. Why? It’s awesome and so is the community it’s built up around it.

Favorite IDE: None. I like Vim (shhhh emacs lovers). Again, why? It’s quick, powerful, and I can use it on any of my machines or while I’m remoted into some server. I value consistency.

Favorite OS: Hands down, any of the Linux derivatives. Does this require justification? Nope.

Favorite place to code: Anywhere with a nice view and lots of natural light.

Anything other questions? Let’s discuss this over a coffee or tea. And don’t forget to checkout my Github account!

-H.

Tagged

Everyone makes mistakes: Even the web’s inventor

Earlier today, while reading over Interactive Data Visualization for the Web, I came across an interesting fact I would like to share. We all know the preamble for web addresses, http://. I couldn’t fathom how many times I’ve typed it myself. Apparently the creator of the world wide web, Tim Berners-Lee, regrets making this part of the standard as he stated during an interview with the New York Times. To summarize, he equated it to be a convention of programmers at the time.

I wonder how many double slashes have been typed, printed, and read since the standards creation. But hey, there is always room for improvement, right?

As an aside, I highly recommend the aforementioned book, Interactive Data Visualization for the Web. Written by Scott Murray, it provides an excellent introduction to data visualization with D3 and basic web programming. What makes this even more cool? O’reilly is offering an online version of the book that includes interactive examples and it’s free! Click the previous link to view the book directly or visit this O’reilly page for more information on about the paper version!

“First make something work. Then, make it work better.” – Anonymous

-H.

Tagged , , , , ,

The Division Of Generation Y

Many of my private thoughts put down as words.

Thought Catalog

America’s Generation Y can be divided into two distinct groups: Those who served in Iraq and/or Afghanistan, such as myself, and those who didn’t. Taking an educated guess, I assume a lion’s share of the readership of Thought Catalog are liberal arts degree bearing, student-loan debt ridden types who think those who joined the military were too stupid to go to college and were unaware cogs in the political war machine run by evil multi-national corporations with the goal of maximizing profit and exploiting the lower class. In turn, we think you’re a bunch of overly sensitive, pretentious, hyper-liberal pussies, so its even. Now, let’s begin to gain an understanding of each other’s perspective.

Our memories of our formative years are quite different. You headed out into early adulthood going to community college or university, be it full-time or part-time. You may have gotten a student loan, a scholarship, paid…

View original post 1,272 more words

DevTalks: Bridging the gap between student and developer

As my fifth semester cruises along at what feels like mach speed, I find myself neck-deep in finite state machines, language grammars, parse trees, and wave optics. Despite putting in more hours than any previous semester (I started tracking out-of-class work hours with Klok2), I feel more prepared to handle it than would have thought possible. But, why do I feel so ready to tackle mountains of homework with countless hours spent at the whiteboard? I think the primary reason is that over the past couple summers I’ve had two amazing internships that allowed me, pushed me even, to use the concepts I learned at school in the real world. This has given me a much deeper understanding and appreciation for school. I can recall the first two semesters and how pointless what I was learning felt at times. Without my practical experience from internships and personal projects would I still feel that way? Probably.

This brings me to a problem I have noticed among my peers; they have a severe lack of motivation. More often than not, I watch people with glazed over eyes in my classes who at worst just do the minimum to pass and at best do the minimum to earn an A. What are they missing? They don’t have the drive they need. But, I’m trying an experiment this semester to see if I, with the help of some very smart people, can change this.

Introducing Developer Talks (DevTalks for short). From the DevTalks frontpage:

Would you like to learn something new, meet like-minded individuals, or give back to the developer community? If so, you’re in the right spot. DevTalks is a developer driven community whose goals are to promote the sharing of knowledge and encourage collaboration among developers regardless of their background or level of expertise.

My plan is to round up some of the more motivated students and local developers in the area and throughout the semester, shake out a couple engaging 1-2 hour talks that can be followed by 1-2 hour small breakout sessions where attendees will work together to improve upon or re-implement what they’ve learned together. I believe that once students see their peers leading talks and working with local developers they will become more inclined to participate.

There are two primary benefits I see coming from this. First, students have a lot to gain from learning from and working with developers. Many questions that simply can’t be answered in the class room can be resolved on the spot. And, let’s not forget, there is just no replacement for real experience. Second, local developers and managers can use these meetings to spot potential interns and employees, locally. Anyone can submit a resume. However, leading one of these talks can demonstrate a much deeper level of understanding and leadership potential than can ever be conveyed on a sheet of paper.

So far, I have received very positive feedback from my peers as well as local developers. Several people wanting to give talks have already reached out and a local coworker space, Studioboro, has offered to host our first talk on October 1st.

If you would like more information please visit the landing page. If you would like to attend a talk, give a talk, donate (if you feed them they will come), or just share your insights please feel free to email me!

“Be the change you wish to see.” ~ Mahatma Gandhi

-H.

Tagged , , , ,

Working the storage scalability problem

The environment and our problem

As a thought experiment let us pretend that we have a website that hosts user projects. Users are allowed to upload files to their projects. Individual files are limited to 100 MB but, there are no file type restrictions. Some time later, our users start asking for version control support. We note that the majority of files submitted by our users are either text files or source code and decide that Git is a good option for implementing a version control system.

Fast forward another couple of months. Our user base has continued to grow steadily and their projects are getting larger and larger. We realize that we have a problem; our server is going to run out of local disk space soon. What’s the best work around for this while sticking to our shoestring budget? That is a very good question.

Is there a simple solution?

When I was asked  to think of a solution, my mind immediately jumped to Amazon Web Services. Specifically, moving our server to an Elastic Compute Cloud (EC2) instance connected to an Elastic Block Store (EBS) volume for storage. Combining the power and resource scalability of an EC2 instance with an easy to use and fast EBS volume could solve the problem. However, after a bit of reflection some very big issues started to arise.

EBS volumes can only be mounted by one EC2 instance at a given time.

Mounting to a single EC2 instance makes scaling our EC2 setup horizontally problematic. If we need to have several EC2 instances communicating with the EBS volume, we have to setup a middle man to act as a go between from the EC2 instances doing the work and the EBS volume storing our Git repositories. So what, right? Well, now we have the added expense and maintenance (however minimal they may be) of another EC2 instance and we’ve introduced a single point of failure. But, those aren’t our only problems.

There is no way to easily resize EBS volumes.

Let’s assume for a second we went ahead and implemented the EC2 with an EBS volume setup. We have created a 500 GB EBS volume and mounted it to our EC2 instance. All is well for another few months until our estimates show that our Git repositories will surpass 500 GB within two weeks. In order to increase the size of our EBS volume we have to create an additional EBS volume at a new set size, let’s say 1 TB, mount it to our EC2 instance, copy everything over from the original EBS volume, finally we can destroy the old EBS volume.  Suddenly I’m having flashbacks to my professor discussing how to resize arrays in my data structures class. And don’t forget about that shoestring budget we’re on.

At $0.10 per GB we’ve just increased our price from $50/month to $100/month for space that we aren’t using. Furthermore, we don’t even know when it will be used and as soon as we approach the 1 TB mark, it will be time to start the process all over again. If we follow the same model of doubling our size, we will again double our costs for unused space. Why not increase the size of our EBS volume in smaller amounts more frequently? It doesn’t solve the problem; we are still purchasing unused space while adding complexity to our system architecture.

Is there an alternative solution? Perhaps a solution that will alleviate our bottleneck and resizing issues? There might be. And it could be much, much cheaper.

Introducing Amazon Simple Storage Solution (S3)

To quote Amazon, “S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web.” Basically, S3 provides us with a way to read, write, and delete files of almost any size (up to 5 TB per file) for very cheap. At a mere $0.076 per GB for the first 49 TB, and even less if we want more storage, we see some solid savings. Before adding our input/output costs,  we have just cut costs to 3/4 of what an equivalent EBS volume would have run us.

What about resizing?

S3 is truly scalable. There is no limit to how large our S3 account can become or how many files we can store in it. Furthermore, we never need to worry about logging into our AWS account and increasing our allotted size, or all of this nonsense regarding copying data from one volume to another; this is all done automatically allowing us to focus on what’s important, our data.

Okay, and the bottleneck?

Again, our problems have been resolved. There is no need to have a middleman connecting the S3 instance to our servers. We can access S3 from anywhere, on any server, through API calls. My language of choice is Python so I have to give a shout out to boto. The good folks that developed boto have provided us with an incredibly simple way to interact with S3. Don’t take my word for it though, Amazon wrote its own tutorial on how to use it.

What’s the catch?

There’s always a catch, right? Right. S3 isn’t exactly designed to work in the way we want it to for our problem. Unlike a storage device such as a hard drive, usb stick, or even an EBS volume, there is no file system. In the simplest terms possible, we can think of S3 as a file dump like Google Drive or Dropbox. Only, instead of uploading files through your browser or desktop application, we access it via API calls. These uploaded files are stored in “buckets,” which can be thought of as directories in an abstract sense. How is this a problem?

Well, let’s say we want to store a Git repository on S3 that looks like:

    /.git/
    /.git/branches/...
    /.git/hooks/... 
    ...

In short, we can’t. There are some solutions using file naming conventions that can be interpreted as directory structures but to be quite frank, they seem very “hackish.” I don’t like “hackish” solutions in production environments.

A possible workaround

I did some digging around and found that, as is always the case, I wasn’t the first person to try and store Git repositories on S3. Some very smart people took a very interesting approach using a Filesystem in Userspace (FUSE).  S3FS or, “Fuse over Amazon” allows us to mount S3 buckets as a local filesystem from which we can read and write to transparently. This sounds pretty cool but just how fast can a fake file system that has been placed on S3 and mounted locally be? Let’s find out.

Benchmarking S3FS with Git

Consistency concerns with S3

S3 has an eventual consistency model. This term means that after writing, updating, or deleting a file it will eventually be viewable from all parties. However, there is no time limit regarding how long this may take. My personal experience has shown changes are generally visible within a second or two.

In 2011, EU and US-west region S3 instances implemented Read-After-Write consistency. This means that changes to writes should be visible to all parties immediately after the write has occurred. Our use case for S3 is: Create a filesystem with S3 on top of which Git repositories can be created letting buckets represent the top level directories of our repositories.

Git very rarely deletes something. Instead it either adds a new hash and snapshot or a pointer to the previous, unmodified version of a file. As a result, I chose to use us-west, at the cost of slightly more expensive rates. Time to benchmark.

The setup

I conducted all of my benchmarks with Ubuntu 12.04. Identical setups were installed on a VirtualBox, running locally, as well as an EC2 instance. I will spare you the installation details. If you would like to have specific installation instructions ,I would recommend checking out this blog post.

After installing S3FS and setting up a test buckets, I created the following file structure:

    /test1.py
    /test2.py
    /seconddir/
    /seconddir/test3.py

Then, each of the following actions was executed 10 times:

  1. initializing the git repository
  2. adding all files to the repository
  3. committing the new changes to the repository

Surprisingly poor results

WITH caching FROM local machine ON Oregon
  • Git init times: 130s
  • Git add times: 15.5s
  • Git commit times: 38.4s

WITH caching FROM ec2 instance ON Oregon

  • Git init times: 103.2s
  • Git add times: 12.2s
  • Git commit times: 36.7s

WITH caching FROM local machine ON Oregon AND max_stat_cache_size=10000000

  • Git init times: 138s
  • Git add times: 16s
  • Git commit times: 39s

WITH caching FROM ec2 instance ON Oregon AND max_stat_cache_size=10000000

  • Git init times: 105.8s
  • Git add times: 12.4s
  • Git commit times: 37.7s

As you can see, initialization, add, and commit times were terribly slow. Although operations ran on the EC2 instance were noticeably faster than those ran on my local machine, they are still nowhere near quick enough. On the bright side, pulling files from S3 was rather fast. Every pull with the tiny repositories in my example took less than one second. But, the slow write times weren’t the only things I found to be a bit troubling.

During this process I also learned that S3FS mounts required root access, mounts were not always stable, and they would inconsistently disconnect. Unfortunately, this idea doesn’t seem to be a workable solution either.

Other Solutions

Jgit

Jgit is a Java implementation of Git-scm and is hosted by eclipse – an open source Java IDE. I found this article describing how to use Jgit to store repositories on S3 using Jgit. However, I have no experience with Jgit or its community and this makes me hesitant to use it without a lot more research.

Caching?

Remember how I said that pulling files from S3 was pretty fast? Actually, it was so strikingly fast compared to writing that it stuck in my head for a while and got me thinking about an alternative solution. What if we used the server’s local storage as a caching layer which works with our Git repositories until the project no longer needs immediate access? We could then use S3 as an external storage device for storing neatly packed repositories until they are needed?

That sounds like an interesting, fun project; more to follow. Time to do some more brainstorming.

"If at first you don't succeed, 
Try, try, try again." - William Edward Hickson

-H.

GitHub offers free student accounts?!

Last fall I started using GitHub to help me maintain and disseminate projects I work on. Not sure what GitHub is? From it’s main page, GitHub is a “Powerful collaboration, code review, and code management for open source and private projects.” 

Built on top of the incredibly powerful version control system Git, GitHub provides users a friendly, intuitive wrapper for Git’s somewhat confusing system on the web (or the desktop!). 

I could go on and on about how much I love Git or how GitHub has made working with peers/co-workers much less stressful, code reviews a pleasure, and handling complicated branches a breeze. It is pretty much awesome, right? Right. But it just got even MORE awesome!

One of my co-workers informed me she had signed up for a free student upgrade with GitHub. By default anyone can sign up and maintain as many public repositories as they like. However, in order to maintain private repositories one would need to sign up for at least a micro plan. Unaware of a student discount and always in search of a deal (every penny counts as a college student!) I wrote the GitHub support staff a quick email about my recent revelation.

In less than 24 hours the very friendly (and humorous) folks at GitHub had replied and upgraded my account to a free micro plan for two years! That’s a total savings of $168 (assuming prices don’t go up between now and then). Only if more companies were as student friendly. I give a tip of my hat to the folks at GitHub. I will be sure my local ACM chapter, peers, and developer friends know about this.

For more information about GitHubs student plan you can visit: https://github.com/edu

Additionally, I highly recommend reading more about Git itself. The good people who created it were kind enough to post a detailed, easy-to-read, and free book online here. If you are a book in hand kind of person, you can grab one from Amazon too!

Thanks, GitHub!

-H.

Tagged , ,
Follow

Get every new post delivered to your Inbox.

Join 369 other followers