# 9.2.2 SETs

One common use of an operation known as map-reduce (which I mentioned in chapters

1 and 6) is calculating unique visitors to a website. Rather than waiting until the

end of the day to perform that calculation, we could instead keep a live updated count

of unique visitors as the day goes on. One method to calculate unique visitors in Redis

would use a SET, but a single SET storing many unique visitors would be very large.

In this section, we’ll shard SETs as a way of building a method to count unique visitors

to a website.

To start, we’ll assume that every visitor already has a unique identifier similar to

the UUIDs that we generated in chapter 2 for our login session cookies. Though we

could use these UUIDs directly in our SET as members and as keys to shard using our

sharding function from section 9.2.1, we’d lose the benefit of the intset encoding.

Assuming that we generated our UUIDs randomly (as we’ve done in previous chapters),

we could instead use the first 15 hexadecimal digits from the UUID as a full key.

This should bring up two questions: First, why would we want to do this? And second,

why is this enough?

For the first question (why we’d want to do this), UUIDs are basically 128-bit numbers

that have been formatted in an easy-to-read way. If we were to store them, we’d be

storing roughly 16 bytes (or 36 if we stored them as-is) per unique visitor. But by only

storing the first 15 hexadecimal digits3 turned into a number, we’d only be storing 8

bytes per unique visitor. So we save space up front, which we may be able to use later

for other problems. This also lets us use the intset optimization for keeping memory

use down.

For the second question (why this is enough), it boils down to what are called *birthday
collisions*. Put simply: What are the chances of two 128-bit random identifiers matching

in the first 56 bits? Mathematically, we can calculate the chances exactly, and as long

as we have fewer than 250 million unique visitors in a given time period (a day in our

case), we’ll have at most a 1% chance of a single match (so if every day we have 250 million

visitors, about once every 100 days we’ll have about 1 person not counted). If we

have fewer than 25 million unique visitors, then the chance of not counting a user falls

to the point where we’d need to run the site for roughly 2,739 years before we’d miss

counting a single user.

Now that we’ve decided to use the first 56 bits from the UUID, we’ll build a sharded

SADD function, which we’ll use as part of a larger bit of code to actually count unique

visitors. This sharded SADD function in listing 9.10 will use the same shard key calculation

that we used in section 9.2.1, modified to prefix our numeric ID with a nonnumeric

character for shard ID calculation, since our 56-bit IDs aren’t densely packed

(as is the assumption for numeric IDs).

With a sharded SADD function, we can now keep unique visitor counts. When we want

to count a visitor, we’ll first calculate their shorter ID based on the first 56 bits of their

session UUID. We’ll then determine today’s date and add the ID to the sharded unique

visitor SET for today. If the ID wasn’t already in the SET, we’ll increment today’s unique

visitor count. Our code for keeping track of the unique visitor count can be seen in

the following listing.

That function works exactly as described, though you’ll notice that we make a call to

get_expected() to determine the number of expected daily visitors. We do this

because web page visits will tend to change over time, and keeping the same number

of shards every day wouldn’t grow as we grow (or shrink if we have significantly fewer

than a million unique visitors daily).

To address the daily change in expected viewers, we’ll write a function that calculates

a new expected number of unique visitors for each day, based on yesterday’s count.

We’ll calculate this once for any given day, estimating that today will see at least 50%

more visitors than yesterday, rounded up to the next power of 2. Our code for calculating

this can be seen next.

Most of that function is reading and massaging data in one way or another, but the overall

result is that we calculate an expected number of unique views for today by taking

yesterday’s view count, increasing it by 50%, and rounding up to the next power of 2.

If the expected number of views for today has already been calculated, we’ll use that.

Taking this exact code and adding 1 million unique visitors, Redis will use approximately

9.5 megabytes to store the unique visitor count. Without sharding, Redis

would use 56 megabytes to store the same data (56-bit integer IDs in a single SET).

That’s an 83% reduction in storage with sharding, which would let us store 5.75 times

as much data with the same hardware.

# Exercise: Filling out the sharded SET API

For this example, we only needed a single SET command to determine the unique visitor

count for a given day. Can you add sharded SREM and SISMEMBER calls? Bonus

points: Assuming that you have two sharded SETs with the same expected total number

of items, as well as the same shard size, you’ll have the same number of shards,

and identical IDs will be in the same shard IDs. Can you add sharded versions of SINTERSTORE,

SUNIONSTORE, and SDIFFSTORE?

OTHER METHODS TO CALCULATE UNIQUE VISITOR COUNTSIf you have numeric

visitor IDs (instead of UUIDs), and the visitor IDs have relatively low maximum

value, rather than storing your visitor information as sharded SETs, you can

store them as bitmaps using techniques similar to what we describe in the

next section. A Python library for calculating unique visitor counts and other

interesting analytics based on bitmaps can be found at https://github.com/Doist/bitmapist.

After sharding large SETs of integers to reduce storage, it’s now time to learn how to

pack bits and bytes into STRINGs.

^{3} Another good question is why 56 and not 64 bits? That’s because Redis will only use intsets for up to 64-bit signed integers, and the extra work of turning our 64-bit unsigned integer into a signed integer isn’t worth it in most situations. If you need the extra precision, check out the Python struct module and look at the Q and q format codes.