Amazon Glacier and glaciercmd

I’m pretty sure that everyone who even reads this blog is familiar with Amazon Web Services. If not, here’s the rundown. Amazon Web Services (AWS) is a HUGE web platform created and managed by Amazon. The biggest point of this platform is that anyone can make an account (provided you have a credit card), and can start using it immediately. If you use only a little bit of resources, you pay a small fee, if you use a lot, you pay a lot. AWS has become big with this pay-as-you-go method of billing, and it has it charms. However, it also mean you can rack up thousands of dollars in bills by accidentially DDoSing yourself.

As said, the AWS platform is massive, with datacenters in USA, EU and Asia. Even though they offer a plethora of services, for me, only three are interesting: EC2, S3 and and the new Glacier. In this article, I’ll be looking at the last two specifically.

Amazon S3

Amazon Simple Storage Services (S3) is a Cloud Storage service. What does that mean? Think of Dropbox, and it’s basically just that. The main difference is the pay-as-you-go model, and that S3 is designed to store lots of files, and serve them fast. S3 can be used to store assets of websites, like images and files, and serve them to the end-users. Just pay Amazon and they’ll make sure your files get served.

Now, I find that Amazon S3 works great, especially with the s3cmd tool, which can be found in the Debian and Ubuntu repository. s3cmd enables you to upload files to your S3 buckets from the command line. I myself use it for backups, and upload a copy of all backups I make to S3. At the moment, I have about 20G of backups in S3, and it’s expanding rapidly. This brings me to the main drawback of S3.

It can get expensive, fast. In my example above, 20G of backups, I would pay 20G * 0.125 = 2.5 USD a month. Not a whole lot. But backups are meant to be stored permanently, off-site. I currently still have about 400G of on-site backups as well. If I would store them all in S3, I would pay 420G * 0.125 = 52.5 USD a month.

To be fair, storing backups isn’t really the intended way S3 should be used. Backups are store-once, read-never-until-something-goes-wrong type of data. Typically, you don’t want to give anyone else access to your backups neither, because they might contain sensitive data. S3 is meant to store a lot of files and serve them to a large audience, like photo’s on a website. Enter Amazon Glacier.

Amazon Glacier

Amazon Glacier?

Amazon Glacier (what, no fancy acronym this time?) is exactly meant to store lots of data that does not have to be accessed frequently or quickly. The way it’s setup is entirely different from S3. Sure, you still have an API you can use, and you’re still storing files in the cloud. But apart from that, Amazon Glacier is:

  1. Slow
  2. Has no fancy AWS console
  3. Slow
  4. Cheap
  5. Did I mention slow?

Like the natural phenomenon it is called after, Glacier is REALLY slow. Whereas S3 has buckets to store files in, Glacier has vaults to store archives in. You can put stuff in your vaults, but getting them back out is another thing entirely. Vault file lists are only generated about once every 24 hours, and requesting one for download takes about four hours. Archives you put in vaults lose all their metadata, and are assigned an ID instead of file name. Luckily, you can still attach a description to an archive, which is included in the vault file lists. Archives can be downloaded from vaults, but again, this also takes about four hours.

The characteristics of Glacier may seem really poor, but think about it. Glacier is meant for archiving stuff, stuff you usually don’t access, but have to store safely and securely anyways. Stuff like backups, photo’s (as in backups or unprocessed RAW images), financial backlog (to comply with government regulations, and so on. For data like this, it doesn’t really matter if you have to wait a few hours before your download is ready, because the data access does not have to be real time. As long as the data is there when you need it.

Finally, unlike S3, Glacier is really cheap. If I were to store the entire 420G of backup data in Glacier, this would cost me 420G * 0.011 = 4.62 USD per month. A lot less than 52.5 USD.

glaciercmd

The one thing that made S3 great to use, was the availability of the s3cmd tool. The possibility to use S3 in any script is great. To my dismay, no such tool existed for Glacier. So I decided to write one, the result of which can be found on the glaciercmd Github page.

glaciercmd supports all basic actions that can be done with Glacier.

  • It can list all your vaults in a given region
  • It can request an vault inventory, and poll the job until it’s finished, presenting you with the archives in the vault
  • It can upload files to a vault
  • It can request a download archives from a vault, by creating the job and polling it until the data is ready

The only feature it really misses in my eye is a fancy progress bar, or indeed any feedback on how fast you’re uploading/downloading. That’s for another day.

Let me know if you’ve tried glaciercmd, and what you think about it in the comments. If you want to contribute, just fork it on Github and send me a pull request.

  • devilzk

    Thanks for sharing great tool. I was looking for some options and was getting a feeling that in order to upload backup to glacier, I might need to code something in PHP for RestAPI.. but then found this tool. Thanks for your hardwork.

    • Lord_Gaav

      Nice, someone is using my tools :P. Make sure you use the development branch, as it contains some critical fixes.

  • Dave Roberts

    Thank you!

  • lkalif

    How big a file you can you upload using this file? I have some very big ones (>100GB).

    • Lord_Gaav

      The filesize is stored internally as a Java long, which has a maximum value of 2^63 – 1, or something like 8388608 TB. However, if it will actually work depends on a number of factors, including the quality of your network.

      So try it out and let me know :P.

  • andrew

    Hi, does the tool support https transfers to Glacier, and if so how should that be specified? (I.e., how does one make it use SSL? ). Thank you!

    • Lord_Gaav

      I’m not setting any specific HTTP/HTTPS options, so I think it’s using the default default in the AWS SDK, which is HTTPS:

      * Communication over HTTPS is the default, and is more secure than HTTP, which
      * is why AWS recommends using HTTPS. HTTPS connections can use more system
      * resources because of the extra work to encrypt network traffic, so the option
      * to use HTTP is available in case users need it.

      • andrew

        Thanks! I will investigate further to make sure that the default in fact does use SSL, I don’t quite know how to check that. I appreciate your help.

        • Lord_Gaav

          The above comment is from the AWS SDK itself. You could also try to upload an unimportant largeish file using glaciercmd, and check what ports are being used using netstat. If using 443, it’s most likely HTTPS.

  • PaulNotBunyan

    I’ll give this a try. I was looking at some of the low cost online backup services but I’m not convinced it’s safe to install their closed source client binaries on any OS. I already have an AWS account with a few micro instances running Ubu 12.04 LTS. The pay for what you use feature really is great. I spend more on a cup of coffee than I’ll spend experimenting and testing with glaciercmd.

  • I’ve been looking at using Glacier rather than Crashplan for some off-site backups of my own personal stuff. Thanks for making some CLI tools!