Scott's Recipes Logo

Understanding AWS S3 Limitations and Performance

IMG_6579.jpeg

I’m in the process of planning out a large scale system and one of the things I find myself worrying is obvious but still complex – storage. This is a data processing system where literally tens of millions of “objects” will be flowing thru the system and one of the design criteria is the need to take any “object” and re-process it from start to finish. And, as this data is transitory in nature, this means that we need to be able to store the original inputs in the system on an ongoing basis.

Note: By “object” I mean something as simple as a comment or as complex as a social media post or a web page.

Given that we have a massive quantity of objects, the question comes up as to how you store tens of millions of variable length objects. The solution we have come to is to use the AWS S3 storage architecture but is S3 a valid choice?

Here’s the result of digging deeply into S3 as a storage solution:

At 3,500 posts (i.e. new object creations) per second, a single S3 bucket can support 3,500 * 3,600 new objects per hour or 12,600,000 (12.6 million). Now that’s a maximum and there are always reasons why maximums don’t get achieved.

The implication here is that if you have different types of content then you could always use multiple buckets, one per content type in order to achieve better overall throughput.

Also for best performance on S3 operations although Amazon has reduced the need for name randomization (think use hashes for names) for best performance, this likely still makes a difference.

More