EBOOK – REDIS IN ACTION

This book covers the use of Redis, an in-memory database/data structure server.

open all | close all

9.3.2 Storing packed data

After we have our packed location codes, we only need to store them in STRINGs with
SETRANGE. But before we do so, we have to think for a moment about how many users
we’re going to be storing information about. For example, suppose that Twitter has 750
million users today (based on the observation that recently created users have IDs
greater than 750 million); we’d need over 1.5 gigabytes of space to store location information
for all Twitter users. Though most operating systems are able to reasonably allocate
large regions of memory, Redis limits us to 512 megabyte STRINGs, and due to
Redis’s clearing out of data when setting a value beyond the end of an existing STRING,
setting the first value at the end of a long STRING will take more time than would be
expected for a simple SETBIT call. Instead, we can use a technique similar to what we
used in section 9.2.1, and shard our data across a collection of STRINGs.

Unlike when we were sharding HASHes and SETs, we don’t have to worry about
being efficient by keeping our shards smaller than a few thousand elements, since we
can access an element directly without decoding any others. Similarly, we can write to
a given offset efficiently. Instead, our concern is more along the lines of being efficient
at a larger scale—specifically what will balance potential memory fragmentation,
as well as minimize the number of keys that are necessary. For this example, we’ll store
location information for 220 users (just over 1 million entries) per STRING, which will
use about 2 megabytes per STRING. In the next listing, we see the code for updating
location information for a user.

Listing 9.15A function for storing location data in sharded STRINGs
USERS_PER_SHARD = 2**20

Set the size of each shard.

def set_location(conn, user_id, country, state):
   code = get_code(country, state)

Get the location code to store for the user.

   shard_id, position = divmod(user_id, USERS_PER_SHARD)

Find the shard ID and position of the user in the specific shard.

   offset = position * 2

Calculate the offset of the user’s data.

   pipe = conn.pipeline(False)
   pipe.setrange('location:%s'%shard_id, offset, code)

Set the value in the proper sharded location table.

   tkey = str(uuid.uuid4())
   pipe.zadd(tkey, 'max', user_id)
   pipe.zunionstore('location:max',
      [tkey, 'location:max'], aggregate='max')
   pipe.delete(tkey)

Update a ZSET that stores the maximum user ID seen so far.

   pipe.execute()

For the most part, there shouldn’t be anything surprising there. We calculate the location
code to store for a user, calculate the shard and the individual shard offset for the
user, and then store the location code in the proper location for the user. The only thing that’s strange and may not seem necessary is that we also update a ZSET that stores the highest-numbered user ID that has been seen. This is primarily important when calculating
aggregates over everyone we have information about (so we know when to stop).