One common use of an operation known as map-reduce (which I mentioned in chapters
1 and 6) is calculating unique visitors to a website. Rather than waiting until the
end of the day to perform that calculation, we could instead keep a live updated count
of unique visitors as the day goes on. One method to calculate unique visitors in Redis
would use a SET, but a single SET storing many unique visitors would be very large.
In this section, we’ll shard SETs as a way of building a method to count unique visitors
to a website.
To start, we’ll assume that every visitor already has a unique identifier similar to
the UUIDs that we generated in chapter 2 for our login session cookies. Though we
could use these UUIDs directly in our SET as members and as keys to shard using our
sharding function from section 9.2.1, we’d lose the benefit of the intset encoding.
Assuming that we generated our UUIDs randomly (as we’ve done in previous chapters),
we could instead use the first 15 hexadecimal digits from the UUID as a full key.
This should bring up two questions: First, why would we want to do this? And second,
why is this enough?
For the first question (why we’d want to do this), UUIDs are basically 128-bit numbers
that have been formatted in an easy-to-read way. If we were to store them, we’d be
storing roughly 16 bytes (or 36 if we stored them as-is) per unique visitor. But by only
storing the first 15 hexadecimal digits3 turned into a number, we’d only be storing 8
bytes per unique visitor. So we save space up front, which we may be able to use later
for other problems. This also lets us use the intset optimization for keeping memory
For the second question (why this is enough), it boils down to what are called birthday
collisions. Put simply: What are the chances of two 128-bit random identifiers matching
in the first 56 bits? Mathematically, we can calculate the chances exactly, and as long
as we have fewer than 250 million unique visitors in a given time period (a day in our
case), we’ll have at most a 1% chance of a single match (so if every day we have 250 million
visitors, about once every 100 days we’ll have about 1 person not counted). If we
have fewer than 25 million unique visitors, then the chance of not counting a user falls
to the point where we’d need to run the site for roughly 2,739 years before we’d miss
counting a single user.
Now that we’ve decided to use the first 56 bits from the UUID, we’ll build a sharded
SADD function, which we’ll use as part of a larger bit of code to actually count unique
visitors. This sharded SADD function in listing 9.10 will use the same shard key calculation
that we used in section 9.2.1, modified to prefix our numeric ID with a nonnumeric
character for shard ID calculation, since our 56-bit IDs aren’t densely packed
(as is the assumption for numeric IDs).
With a sharded SADD function, we can now keep unique visitor counts. When we want
to count a visitor, we’ll first calculate their shorter ID based on the first 56 bits of their
session UUID. We’ll then determine today’s date and add the ID to the sharded unique
visitor SET for today. If the ID wasn’t already in the SET, we’ll increment today’s unique
visitor count. Our code for keeping track of the unique visitor count can be seen in
the following listing.
That function works exactly as described, though you’ll notice that we make a call to
get_expected() to determine the number of expected daily visitors. We do this
because web page visits will tend to change over time, and keeping the same number
of shards every day wouldn’t grow as we grow (or shrink if we have significantly fewer
than a million unique visitors daily).
To address the daily change in expected viewers, we’ll write a function that calculates
a new expected number of unique visitors for each day, based on yesterday’s count.
We’ll calculate this once for any given day, estimating that today will see at least 50%
more visitors than yesterday, rounded up to the next power of 2. Our code for calculating
this can be seen next.
Most of that function is reading and massaging data in one way or another, but the overall
result is that we calculate an expected number of unique views for today by taking
yesterday’s view count, increasing it by 50%, and rounding up to the next power of 2.
If the expected number of views for today has already been calculated, we’ll use that.
Taking this exact code and adding 1 million unique visitors, Redis will use approximately
9.5 megabytes to store the unique visitor count. Without sharding, Redis
would use 56 megabytes to store the same data (56-bit integer IDs in a single SET).
That’s an 83% reduction in storage with sharding, which would let us store 5.75 times
as much data with the same hardware.
Exercise: Filling out the sharded SET API
For this example, we only needed a single SET command to determine the unique visitor
count for a given day. Can you add sharded SREM and SISMEMBER calls? Bonus
points: Assuming that you have two sharded SETs with the same expected total number
of items, as well as the same shard size, you’ll have the same number of shards,
and identical IDs will be in the same shard IDs. Can you add sharded versions of SINTERSTORE,
SUNIONSTORE, and SDIFFSTORE?
OTHER METHODS TO CALCULATE UNIQUE VISITOR COUNTSIf you have numeric
visitor IDs (instead of UUIDs), and the visitor IDs have relatively low maximum
value, rather than storing your visitor information as sharded SETs, you can
store them as bitmaps using techniques similar to what we describe in the
next section. A Python library for calculating unique visitor counts and other
interesting analytics based on bitmaps can be found at https://github.com/Doist/bitmapist.
After sharding large SETs of integers to reduce storage, it’s now time to learn how to
pack bits and bytes into STRINGs.
3 Another good question is why 56 and not 64 bits? That’s because Redis will only use intsets for up to 64-bit signed integers, and the extra work of turning our 64-bit unsigned integer into a signed integer isn’t worth it in most situations. If you need the extra precision, check out the Python struct module and look at the Q and q format codes.