If you read the FAQ section for
Amazon's Simple Storage Service (S3) you'll notice it says you can store an
infinite amount of data on the cloud, I'm here to tell you that is not entirely
the truth. Well, I'm sure the experienced among my readers will realize the
hyperbole in the marketing but the truth is that you can store an almost
infinite amount of data, and you will never be able to come close to the storage
limit of S3. But what is the ceiling for how much data you can store in a single
bucket? That is the question we're going to answer today.
This article turned into something I wasn't expecting in regards to its
length, I thought this was going to be a fun little problem but it turns out
there's a lot more to it than what I was expecting.
Me: 1 Month Ago.
Let's start with the basics, S3 is a cloud storage service by Amazon. S3 allows
you to upload files as objects inside "buckets" with a high level of reliability
and availability. When S3 was launched in 2006 the
original spec was very succinct,
"make malloc for the web", and that's exactly how it works. You can create a
bucket and upload an object up to a maximum size of 5 Terabytes. This is pretty
big for the vast majority of use cases, in fact, I can't think of many file
formats that support sizes on this scale.
S3 has a hard limit of object keys being 1024 bytes long, using UTF-8 encoding.
So any object we store in S3 must have a key that fits this size. Any object key
must also be encoded in valid UTF-8 as well. UTF-8 is an encoding standard that
can store any character in the world that had ever existed using one to four
bytes. For example, the character a is encoded as 0b0110_0001 and the
character ス is 0b11100011_10000010_10111001. The complete encoding system
can be found on the internet but the
basic gist is English and commonly spoken languages use smaller encodings while
rare characters use larger encoded values.
So it should be simple to calculate the number of different object keys that can
be made with UTF-8, Wikipedia says there are 1,112,064 valid characters in
Unicode so there should be 1,112,06410241 different key combinations for
S3 objects? Well actually because Unicode characters are encoded with up to 4
bytes (depending on their frequency), the actual combination of valid Unicode
units that fit in a 1024-byte string is smaller. How much smaller? Well, that
depends on the distribution of characters.
Unicode vs UTF-8?
This post is going to spiral into the complexity that is the Unicode spec for a
moment so buckle up because it's important to understand how we got here.
Unicode is a standard for encoding symbols with the goal of encoding all symbols
found in all writing systems,
even ones from 3000 years ago.
As I mentioned before there are 1,112,064 current characters defined in Unicode,
the focused reader among you might realize I've been talking about UTF-8 and
Unicode one could be mistaken for thinking these are the same thing but that's
not the case - Unicode is not UTF-8. UTF-8 is a character set that encodes
characters from Unicode, there have been other encoding attempts for Unicode
such as UTF-16 and UTF-1 but these aren't used anymore.
The important thing to remember is that Unicode has
character blocks called Planes,
these are used to categorise characters and are encoded into UTF-8.
As usual Tom Scott has already done a video on the subject which you can watch
below that goes into Unicode's history far better than I could ever do.
But going back to the problem at hand, at first I assumed finding all the
Unicode values was going to be the hardest part because there were a few
problems I encountered in finding a complete character set in a usable format.
For the longest time, I struggled to find a list or dataset of all the Unicode
characters (there are lists of characters in
pdf format on the Unicode website) in an
easy-to-parse format that has the information I needed. I've since found a few
online.
To pull the data into a usable format I create a simple python script that will
pull all the characters off as a table from
fileformat.info.
Using BeautifulSoup the script scrapes a page at a time of Unicode characters
and adds it to a CSV file that can be queried.
From this list, we can see the breakdown of different-sized characters using a
bit of data crunching in whatever language you like (I converted the UTF-8
encoding into a string and used the length to find the sizes for each
character).
I've downloaded the results of my web scrape into a file
unicode.csv so you don't have to.
Getting into the Maths
So now that the encodings for all Unicode characters are known these values can
be used to calculate the number of string combinations.
A byte is made of 8 bits so there are 28=256 different combinations that a
byte can take. Since the string is 1024 bytes long the total number of different
key combinations is equal to:
281024=2561024
This is the upper limit for how many key combinations there can be, if we
calculate a value above this number we know we've done something incorrectly.
This is a good sanity check for later verification.
But we can do better than finding the upper limit in calculating the answer. We
should be able to get the exact number. But we'll need to build up a solution
from first principles.
Let's start with a simpler problem. How many combinations of 1-byte UTF-8
encoded values with a length of 1024 bytes are there?
From the table above we can see there are 128 characters using 1-byte encoding
(this includes 0-width characters such as null and return but I'll cover
this later), some may think this is weird because 1 byte can encode 256 symbols,
the reason for this is Unicode is backwards compatible with ASCII which was a
7-bit encoding that only has 127 symbols. To keep this compatibility and support
variable width with larger symbols Unicode 1-byte symbols start with a leading
zero.
So now that we know how many symbols are defined we can calculate based on the
length how many combinations there are, this is pretty straightforward if you
remember any high school statistics/probability. There are 128 choices over 1024
decisions, this can be written like so:
1024128×128×…×128=1281024
When you do the math you get a big number...
But it's smaller than our upper limit so we're on track. This is a good start
but only represents strings that contain characters with 1-byte encoding, now we
could do this for the other 2-4 byte encoded characters but you'll see soon
there is another/better way to do this. We're going to need to do some more work
to find the full answer.
Let's look at combinations of 1-byte and 2-byte characters. There are 128 1-byte
characters and 1863 2-byte characters, we need to build a formula to find the
combination of all 1-2 byte encoded strings. We should be able to use the same
general form from our previous answer. But what are the powers of the values
going to be? Let's start with a simple case where we have one 2-byte character
and the rest are 1 byte, since our string is 1024 bytes long the number of
1-byte characters will be 1024−2=1022
1281022×18631
Okay that makes sense, what about two 2-byte characters, well 1024−4=1020
so...
1281020×18632
This seems to make intuitive sense, the rule seems to be that the powers need to
always equal 1024, so in this instance, we can make a general rule that
128i×1863j∀i+2j=1024
Cool, so now we have a formula, but this doesn't give us a great way of showing
how to iterate over all the values, using a summation operator will though.
j=0∑5121281024−2j1863j
This is starting to look good! The summation will represent all the different
combinations of 1-2 byte strings - or does it? There is one more thing we need
to add for this to give us all combinations.
If we think back to what this equation represents each coefficient actually
represents a position of a unique character. As an example think about a
three-character string, with two 1-byte and one 2-byte characters such as
128×128×1863 and 128×1683×128. While both of these equal
the same number they represent two different combinations of the same characters
just in different positions. We need to include all the different variations in
how you can build each string with the same characters.
Luckily this is a fairly ubiquitous problem to deal with and there is a way of
solving this. The binomial coefficient (kn) meaning n choose k is a
compact way of describing how many combinations of k items we choose from n,
i.e. n choose k, this is the exact problem we are dealing with! The binomial
coefficient can be found using factorials.
(kn)=k!(n−k)!n!
This is great, it gives us a way to find combinations of patterns, so now we
just have to figure out what values n and k should be. n should be the
total number of tokens, this will not always be 1024 because 2-byte tokens take
up twice the space as single-byte ones. k is the number of 2-byte tokens.
Putting our binomial coefficient into our equation yields
j=0∑512(j1024−j)⋅1281024−2j⋅1863j
This sum represents the number of combinations of 1-2 byte unique strings that
can be made. We've finally arrived at an equation that will provide all the
combinations for a string using 1-2 byte UTF-8 characters!
We can now use this general form and expand to include higher 3-4 byte symbols.
We can create a chain of binomial solutions and subtract the values of higher
order terms to find the number of combinations for 1-3 byte strings...
Now this is all looking a bit confusing, and the equation is long, this might be
a math post but nobody wants to deal with binomials. Upon researching this topic
I found out about multinomials, which are similar to binomials but generalised
to provide coefficients for equations with more than 2 terms.
This is perfect for what we need, it's a natural extension to chained binomials
like the one seen above. Replacing the binomial chain we find that the
simplified equation looks like this.
A much cleaner equation! This is our final form that represents any string
combination that can be made using Unicode characters encoded with UTF-8. Now,
all we need to do is convert this to a program that can give us an answer. I'm
going to use Julia which is a bit of a cult favourite of mine. It's a functional
language that excels at parsing mathematical text "as is" and gets used a lot in
computational mathematics for that reason.
When you run the above code after a few minutes we'll arrive at our final
answer.
Damn, that's a big number. Confirming it's smaller than our above upper limit we
can be confident we've come up with the answer. At 2209 digits this number
represents the number of unique combinations of strings that can be made from
valid UTF-8 encoded Unicode characters that fit within 1024 bytes!
Fun Fact! - If we were to try and create an object for all the possible key
combinations in S3 at the current
request limit of 3500 requests/second
it would take longer than the expected life of the universe to do... even if
all the computers on Earth tried to write to a single bucket it wouldn't be
possible! So I don't think AWS has to worry about this limit anytime soon.
Null Character and String Length
I'm going to make a small omission here about how I've built the strings. String
termination is done with a null character 0b0000_0000. When a program sees
this it knows it has reached the end of a string. The 128 characters that use
1-byte encoding also includes the ASCII null character. This means it's
technically possible to have a string that has multiple null characters in
different positions like 0b0000_0000_0111_1111 which is technically a valid
Unicode string, but most likely it would not properly be read by any program.
Null characters can also be used to calculate all the string combinations of
less than 1024 bytes in length as the string 0b0111_1111_0000_0000 is the same
as 0b_0111_1111, so this is a simple way of calculating the shorter strings.
If this feels like a cheat it's because it technically is, I don't know of any
system that will continue to read after a null character.
If you want to see a better way watch this space.
Closing Thoughts
What I thought was going to be a casual adventure into the S3 documentation and
Unicode spec turned into a month-long study of combinatorics and probability.
It's been an interesting ride, I've learnt some new things and I hope you have
too.
A big thank you to all the folks who have proofread this article!
Footnotes
This number is stupidly big, it has 6192 digits. For context there are
1082 atoms in the universe.
👋 I'm Robert, a Software Engineer from Melbourne, Australia. I write about a bunch of different topics including technology, science, business, and maths.