The environment and our problem
As a thought experiment let us pretend that we have a website that hosts user projects. Users are allowed to upload files to their projects. Individual files are limited to 100 MB but, there are no file type restrictions. Some time later, our users start asking for version control support. We note that the majority of files submitted by our users are either text files or source code and decide that Git is a good option for implementing a version control system.
Fast forward another couple of months. Our user base has continued to grow steadily and their projects are getting larger and larger. We realize that we have a problem; our server is going to run out of local disk space soon. What’s the best work around for this while sticking to our shoestring budget? That is a very good question.
Is there a simple solution?
When I was asked to think of a solution, my mind immediately jumped to Amazon Web Services. Specifically, moving our server to an Elastic Compute Cloud (EC2) instance connected to an Elastic Block Store (EBS) volume for storage. Combining the power and resource scalability of an EC2 instance with an easy to use and fast EBS volume could solve the problem. However, after a bit of reflection some very big issues started to arise.
EBS volumes can only be mounted by one EC2 instance at a given time.
Mounting to a single EC2 instance makes scaling our EC2 setup horizontally problematic. If we need to have several EC2 instances communicating with the EBS volume, we have to setup a middle man to act as a go between from the EC2 instances doing the work and the EBS volume storing our Git repositories. So what, right? Well, now we have the added expense and maintenance (however minimal they may be) of another EC2 instance and we’ve introduced a single point of failure. But, those aren’t our only problems.
There is no way to easily resize EBS volumes.
Let’s assume for a second we went ahead and implemented the EC2 with an EBS volume setup. We have created a 500 GB EBS volume and mounted it to our EC2 instance. All is well for another few months until our estimates show that our Git repositories will surpass 500 GB within two weeks. In order to increase the size of our EBS volume we have to create an additional EBS volume at a new set size, let’s say 1 TB, mount it to our EC2 instance, copy everything over from the original EBS volume, finally we can destroy the old EBS volume. Suddenly I’m having flashbacks to my professor discussing how to resize arrays in my data structures class. And don’t forget about that shoestring budget we’re on.
At $0.10 per GB we’ve just increased our price from $50/month to $100/month for space that we aren’t using. Furthermore, we don’t even know when it will be used and as soon as we approach the 1 TB mark, it will be time to start the process all over again. If we follow the same model of doubling our size, we will again double our costs for unused space. Why not increase the size of our EBS volume in smaller amounts more frequently? It doesn’t solve the problem; we are still purchasing unused space while adding complexity to our system architecture.
Is there an alternative solution? Perhaps a solution that will alleviate our bottleneck and resizing issues? There might be. And it could be much, much cheaper.
Introducing Amazon Simple Storage Solution (S3)
To quote Amazon, “S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web.” Basically, S3 provides us with a way to read, write, and delete files of almost any size (up to 5 TB per file) for very cheap. At a mere $0.076 per GB for the first 49 TB, and even less if we want more storage, we see some solid savings. Before adding our input/output costs, we have just cut costs to 3/4 of what an equivalent EBS volume would have run us.
What about resizing?
S3 is truly scalable. There is no limit to how large our S3 account can become or how many files we can store in it. Furthermore, we never need to worry about logging into our AWS account and increasing our allotted size, or all of this nonsense regarding copying data from one volume to another; this is all done automatically allowing us to focus on what’s important, our data.
Okay, and the bottleneck?
Again, our problems have been resolved. There is no need to have a middleman connecting the S3 instance to our servers. We can access S3 from anywhere, on any server, through API calls. My language of choice is Python so I have to give a shout out to boto. The good folks that developed boto have provided us with an incredibly simple way to interact with S3. Don’t take my word for it though, Amazon wrote its own tutorial on how to use it.
What’s the catch?
There’s always a catch, right? Right. S3 isn’t exactly designed to work in the way we want it to for our problem. Unlike a storage device such as a hard drive, usb stick, or even an EBS volume, there is no file system. In the simplest terms possible, we can think of S3 as a file dump like Google Drive or Dropbox. Only, instead of uploading files through your browser or desktop application, we access it via API calls. These uploaded files are stored in “buckets,” which can be thought of as directories in an abstract sense. How is this a problem?
Well, let’s say we want to store a Git repository on S3 that looks like:
/.git/ /.git/branches/... /.git/hooks/... ...
In short, we can’t. There are some solutions using file naming conventions that can be interpreted as directory structures but to be quite frank, they seem very “hackish.” I don’t like “hackish” solutions in production environments.
A possible workaround
I did some digging around and found that, as is always the case, I wasn’t the first person to try and store Git repositories on S3. Some very smart people took a very interesting approach using a Filesystem in Userspace (FUSE). S3FS or, “Fuse over Amazon” allows us to mount S3 buckets as a local filesystem from which we can read and write to transparently. This sounds pretty cool but just how fast can a fake file system that has been placed on S3 and mounted locally be? Let’s find out.
Benchmarking S3FS with Git
Consistency concerns with S3
S3 has an eventual consistency model. This term means that after writing, updating, or deleting a file it will eventually be viewable from all parties. However, there is no time limit regarding how long this may take. My personal experience has shown changes are generally visible within a second or two.
In 2011, EU and US-west region S3 instances implemented Read-After-Write consistency. This means that changes to writes should be visible to all parties immediately after the write has occurred. Our use case for S3 is: Create a filesystem with S3 on top of which Git repositories can be created letting buckets represent the top level directories of our repositories.
Git very rarely deletes something. Instead it either adds a new hash and snapshot or a pointer to the previous, unmodified version of a file. As a result, I chose to use us-west, at the cost of slightly more expensive rates. Time to benchmark.
I conducted all of my benchmarks with Ubuntu 12.04. Identical setups were installed on a VirtualBox, running locally, as well as an EC2 instance. I will spare you the installation details. If you would like to have specific installation instructions ,I would recommend checking out this blog post.
/test1.py /test2.py /seconddir/ /seconddir/test3.py
Then, each of the following actions was executed 10 times:
- initializing the git repository
- adding all files to the repository
- committing the new changes to the repository
Surprisingly poor results
- Git init times: 130s
- Git add times: 15.5s
- Git commit times: 38.4s
WITH caching FROM ec2 instance ON Oregon
- Git init times: 103.2s
- Git add times: 12.2s
- Git commit times: 36.7s
WITH caching FROM local machine ON Oregon AND max_stat_cache_size=10000000
- Git init times: 138s
- Git add times: 16s
- Git commit times: 39s
WITH caching FROM ec2 instance ON Oregon AND max_stat_cache_size=10000000
- Git init times: 105.8s
- Git add times: 12.4s
- Git commit times: 37.7s
As you can see, initialization, add, and commit times were terribly slow. Although operations ran on the EC2 instance were noticeably faster than those ran on my local machine, they are still nowhere near quick enough. On the bright side, pulling files from S3 was rather fast. Every pull with the tiny repositories in my example took less than one second. But, the slow write times weren’t the only things I found to be a bit troubling.
During this process I also learned that S3FS mounts required root access, mounts were not always stable, and they would inconsistently disconnect. Unfortunately, this idea doesn’t seem to be a workable solution either.
Jgit is a Java implementation of Git-scm and is hosted by eclipse – an open source Java IDE. I found this article describing how to use Jgit to store repositories on S3 using Jgit. However, I have no experience with Jgit or its community and this makes me hesitant to use it without a lot more research.
Remember how I said that pulling files from S3 was pretty fast? Actually, it was so strikingly fast compared to writing that it stuck in my head for a while and got me thinking about an alternative solution. What if we used the server’s local storage as a caching layer which works with our Git repositories until the project no longer needs immediate access? We could then use S3 as an external storage device for storing neatly packed repositories until they are needed?
That sounds like an interesting, fun project; more to follow. Time to do some more brainstorming.
"If at first you don't succeed, Try, try, try again." - William Edward Hickson