Amazon S3 is one of the the most popular cloud storage services. It is helpful to use the traditional file system as a reference to start working with S3, however there are architectural differences that you must know about if you are doing anything serious in the cloud. In this post I review some foundational and advanced features that have popped up around S3 in AWS consulting sessions with clients.
- Object storage
Amazon S3 is not a file system, it is an object store – it’s a flat structure of objects and their containers (buckets). Again, AWS S3 has buckets of objects. There is no folder, subfolder or hierarchical structure. Buckets and objects are addressed by keys not filename and file paths. However, AWS S3 management console creates an impression of hierarchical folder structure and file system in your browser, so that you can look into object storage like a file system. Don’t let this impression affect your understanding of the backend.
AWS client tools also show objects in a similar way. You get the impression of a file and folder structure, however it is mainly listing objects with a common prefix in keys. Some are shown as file and some as sub-folders with PRE label for prefix. An empty folder itself is a zero-size object, and a folder is displayed for an object with a key ending with / separator, however it will disappear eventually if it remains empty.
- Limits and Storage classes
Buckets are private by default and must be unique at a regional scale. Files are stored as binary large objects (blobs) in buckets and can be from 0 byte to 5 tera bytes. There are no limits to the number of objects in a bucket – no matter how many objects you have, you get the same performance. There is a limit of 100 buckets per AWS account that should be considered in storage architecture. The largest file size in a single upload request is 5 GB and you can use the multipart upload feature if the file size is more than 100 MB.
Buckets can be created under several storage classes for different use cases. All objects in a bucket will have the same storage class. At the time of writing we know the following S3 storage classes:
- S3 Standard: Preferred for frequent access and availability 99.99% SLA 99.9%
- S3 Standard-IA: Preferred for Infrequent Access and availability 99.9% SLA 99%
- S3 One Zone-IA: Limited to one availability zone, availability 99.5% SLA 99%
- Amazon Glacier: Preferred for archival purpose with high latency (minutes to hours)
All the S3 storage classes provide 11 9’s durability, support lifecycle, and all except S3 One-Zone-IA support multiple availability zones.
- Consistency model
First I should mention that all successful S3 changes are atomic. Operations on objects take time and there are propagation delays, however you will never get an intermediary, non-complete or corrupted version of the object.
AWS S3 guarantees that update/overwrite and delete requests to S3 objects have eventual consistency. This means that you get either old state or new state of the object in a consistent way. It does not matter where your object is stored and how the changes propagate. You get consistent results.
Another AWS S3 guarantee is read-after-write consistency. It allows you to retrieve an object immediately after creation. No waiting time is required to be sure that object has been created and propagated. However it doesn’t mean that if you list the bucket you’ll find the new object there – you get eventual consistency for this bucket operation. As another example, you get eventual consistency for read-after-delete/update operations.
- Undelete and Recycle bin
Amazon S3 has a track record of reliability and it is almost impossible to lose your data in S3. However, you may want to protect yourself against accidental delete or need to recover a lost file; for this purpose you need to activate some of the advanced features of S3 buckets. By enabling object versioning you keep a copy of all the object versions, including deleted and replaced objects.
To make it a cost effective solution, you only need to keep recent versions of the files or backup the older version to a cheaper storage class. Lifecycle rules address this need, and you can add rules to manage object versions. Rules can remove versions after some specific time, or can automatically transition them to a different storage class or archive in Amazon Glacier.
- Security and Access control
S3 security has made a lot of headlines and it is important to know more about it. Firstly, every S3 resource is private and secured for owner. It is owned by the AWS account that created it. Even bucket owner cannot read (download) an object created by another account within the bucket without delegated permissions. Having said this, all the S3 security breaches are either a result of human mistake or malicious actor.
Also S3 is among few AWS services that allows resource based access policies, in other words you can attach bucket policies and access control lists (ACL) to your Amazon S3 resources and access is based on the least privilege union of all the granted permissions.
Let’s take a short look into different access controls in S3:
- IAM (Identity based controls): This is the standard access control method in AWS. Permission policies (IAM policies) are defined that determine what actions can or cannot be done on an AWS resource (in this case S3 resources). These IAM policies are assigned to AWS identities (user/group/role) to create identity based access controls.
- Bucket Policies (Resource based control): Permission policies very similar to IAM policies are defined that determine what actions can or cannot be done on a bucket and all its objects. However, there are differences to IAM policies. First, you assign these policies to a S3 bucket resource not an AWS identity. Next, because the policy is not attached to an AWS identity, we should use the principal element to define the user ( AWS account, AWS Service, IAM user, federated user, assumed-role user or other entities)
- ACLs (Resource based control): Object and bucket Access Control Lists allow you to set fine grained access control on a bucket and each individual object within the bucket. ACLs are configured via S3 APIs or via the permissions tab in S3 dashboard of AWS console.
Note: S3 ACL is the legacy access control method and uses 64 digit (256bit) Canonical User Id of an AWS account to grant access to it. You will need this number to define cross account access to S3 buckets. It is the long owner-id string found using ListAllMyBuckets API call or equivalent command in AWS CLI (aws s3api list-buckets).
AWS S3 is an extremely scalable, durable and highly available infrastructure with security provisions and cost efficiency. You need to build a good understanding of the technical foundations and advanced features if you are going to get the best out of it.
Amazon S3 FAQ page
Amazon S3 documentation