Cold Storage Data Retrieval For AWS Glacier And Glacier Deep Archive
Storing data on so-called cold storage is really cheap. But restoring data takes time (up to several hours) and can be a bit tricky if you want to restore multiple files or directories.
Different Storage Classes
Simple Storage Service (S3) is Amazon’s Cloud storage service. Depending on the chosen storage class the monthly price and the time it will take to retrieve your data back varies. Storing 500 GB with S3 Standard Storage class costs USD 11.50 whereas Glacier only costs USD 2.00 per month (January 2021). It can be even cheaper if you’re choosing Glacier Deep Archive. By using this storage class you only pay USD 0.49 per month for storing 500 GB/month! So from a price point of view, it is quite easy — just use Glacier or Glacier Deep Archive for backups and save a lot of money. So where is the big drawback?
Cold Storage Vs Hot Storage — Choosing The Right Retrieval Options
If you need frequent access and want to download your files immediately you have to use the so-called hot storage type. S3 Standard for example is such a hot storage type. The so-called first-byte latency is specified in milliseconds. Cold storage on the other hand does not require fast access and cloud provider scan use less expensive media storage devices and probably do not power these devices all the time. Glacier and Glacier Deep Archive are cold storage types. The first-byte latency ranges from minutes up to several hours.
So putting data on a Cold Storage type like Glacier or Glacier Deep Archive only makes sense if you infrequently want to access your files and if you do not care if the retrieving process can take several hours. If you are willing to pay for quicker access you can decrease the retrieving time by choosing the Expedited Retrieval tier.
Restoring A Single File Using The Management Console
Let’s assume you just want to restore one particular file. In that case the Management Console is probably the easiest way. If you want to restore multiple files or complete directories using the Management Console would be a time-consuming madness. The chapter Restoring Multiple Files Using The AWS Command Line Interface (CLI) And S3cmd describes a better way for doing this (read it below).
Let’s start: Go to the Management Console, search for S3 and select the corresponding bucket which holds the object you want to restore.
Once the retrieval process has finished you can finally download your file. To do so select the particular file within the bucket and under Actions, you will find the option Download actions. Just initiate the download from there to your local machine. Note that the file will be downloadable for that time only you have specified within the Initiate restore dialog under the option Number of days that the restored data is available.
Restoring Multiple Files Using The AWS Command Line Interface (CLI) And S3Cmd
If you have to restore multiple files or complete directories the AWS Management Console would be a pain to use. It simply does not offer such functionality as a one-click operation. For this use case, the AWS Command Line Interface (CLI) [7] or the command-line tool S3cmd [6] are better choices. Both tools are available for Windows, Linux, and macOS and are easy to install. Please follow the given links [6, 7] to get them up and running.
Restore Data Via The CLI
Assume you want to restore every single object within a given bucket. The following list shows all the steps to get your data back via running a CLI command within a shell.
- Get a list of all objects within a bucket
- Initiate the restoration process of these objects
- Finally, download your data
Get A List Of All Objects Within A Bucket
aws s3api list-objects-v2 \
--bucket my-cold-storage-bucket \
--query "Contents[?StorageClass==’GLACIER’]" \
--output text \
| awk -F’\t’ ‘{print $2}’ > glacier-object-list.txt
The above snippet takes the bucket my-cold-storage-bucket and queries all objects where the StorageClass is GLACIER. The option -output text creates a tab-separated list of all objects within the bucket. A single entry contains:
- ETag — a hash of the object
- Key — the objects name
- Last modified date
- File size
- Storage type
"84b933df90e6a97bb1f41fec82e101ad" dir1/example.obj 2021–01–21T16:30:36.000Z 10244 GLACIER
The final pipe into the pattern scanning and processing tool awk [8] takes the seconds column (key-name of the object) of the text output and writes it into the file glacier-object-list.txt.
Initiate The Restore Process Of These Objects
To restore a particular file use the following snippet to initiate the restore process:
aws s3api restore-object \
--bucket my-cold-storage-bucket \
--key dir1/example.obj \
--restore-request ‘{
"Days" : 1,
"GlacierJobParameters" : {"Tier":"Bulk"}
}’
Please note that specifying the tier class (Expedited, Standard, or Bulk) influences the retrieval time and also the cost of the restore process. The option Days specifies how many days you want to have access to the restored object/file.
To monitor the status of your request the following command can be used:
# Monitor the status of your restore requestaws s3api head-object --bucket my-cold-storage-bucket --key dir1/example.obj# Still in progress
{
"Restore": "ongoing-request=\"true\"",
"StorageClass": "GLACIER",
...
}# Restore has been completed
{
"Restore": "ongoing-request=\"false\", expiry-date=\"Mon, 15 Feb 2021 00:00:00 GMT\"",
"StorageClass": "GLACIER",
...
}
Next step: Loop over the complete list of files in glacier-object-list.txt to initiate the restore request:
while read KEY
do
echo "Restoring object: $KEY"
aws s3api restore-object \
--bucket my-cold-storage-bucket \
--key "$KEY" \
--restore-request ‘{
"Days" : 1,
"GlacierJobParameters" : {"Tier":"Bulk"}
}’
echo "Restore request for $KEY sent"
done < glacier-object-list.txt
The above script requests the restore process in a sequential way for all files in glacier-object-list.txt. To request the restore process for a huge number of files this script should be improved from a single threaded solution to a multi-threaded one. — i.e. spawn up several workers which can request in parallel your list of files to be restored.
When the restore process has successfully executed you can finally download the restored files to your local machine with the following CLI command. All files will be synced into the local folder ./backup.
aws s3 sync s3://my-cold-storage-bucket ./backup --force-glacier-transfer
Restore Data With The Open Source Tool S3cmd
Another way for retrieving your data from S3 is the Open Source Python tool S3cmd [6]. The GPL-2-licensed tool supports the complete chain for dealing with objects and buckets on S3 (upload, query, retrieve, delete etc.)
Under macOS the probably easiest way to install S3cmd is the package manager brew [9]:
brew install s3cmd
After installation you have to configure the tool
s3cmd --configure
and provide your access and secret key which you have to generate within your Amazon AWS account. S3cmd offers the option -r, — recursive which allows recursive operations. Retrieving data of a bucket with all its objects and potential sub-directories can be easily done with the following command:
s3cmd restore \
--recursive s3://my-cold-storage-bucket \
--restore-priority=bulk \
--restore-days=7
If the restore process has been successfully executed (can take minutes up to several hours) you can finally download the restored files to your local machine.
s3cmd sync s3://my-cold-storage-bucket/folder /destination/folder
That’s it. You now have learned three different methods for retrieving your data back from S3.
Summary
- The cost of storing data on Glacier depends on your chosen storage class. The colder the storage the cheaper it gets. For Glacier Deep Archive you will pay less than for using Glacier.
- Retrieving data takes minutes up to several hours. You can choose retrieval tiers (Expedited, Standard, and Bulk) to speed up the process. So only use Glacier or Glacier Deep Archive if you very rarely want to access these files otherwise S3 Standard would be the better choice.
- Retrieving data costs money too. The cost depends on the retrieval tier and the amount of data you want to restore. Calculate your retrieval costs forehand with the AWS Price Calculator [3].
- Deleting data on Glacier younger than 90 days also incurs an extra charge.
- For retrieving single objects the AWS Management Console is handy to use. For complete directories or multiple files use the AWS CLI or the S3cmd tool.
References
- [Restoring Archived Objects] https://docs.aws.amazon.com/AmazonS3/latest/dev/restoring-objects.html
- [AWS S3 Pricing] https://aws.amazon.com/s3/pricing
- [AWS Price Calculator] https://calculator.aws
- [AWS Glacier API] https://awscli.amazonaws.com/v2/documentation/api/latest/reference/glacier/index.html
- [AWS Glacier FAQ] https://aws.amazon.com/glacier/faqs/?nc1=h_ls
- [S3cmd] https://s3tools.org/s3cmd
- [Install AWS CLI] https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html
- [AWK Linux Man Page] https://linux.die.net/man/1/awk
- [Brew Package Manager] https://brew.sh