Documentation - Redise Pack

A guide to Redise Pack installation, operation and administration

open all | close all

9.3.2 Storing packed data

After we have our packed location codes, we only need to store them in STRINGs with SETRANGE. But before we do so, we have to think for a moment about how many users we’re going to be storing information about. For example, suppose that Twitter has 750 million users today (based on the observation that recently created users have IDs greater than 750 million); we’d need over 1.5 gigabytes of space to store location information for all Twitter users. Though most operating systems are able to reasonably allocate large regions of memory, Redis limits us to 512 megabyte STRINGs, and due to Redis’s clearing out of data when setting a value beyond the end of an existing STRING, setting the first value at the end of a long STRING will take more time than would be expected for a simple SETBIT call. Instead, we can use a technique similar to what we used in section 9.2.1, and shard our data across a collection of STRINGs.

Unlike when we were sharding HASHes and SETs, we don’t have to worry about being efficient by keeping our shards smaller than a few thousand elements, since we can access an element directly without decoding any others. Similarly, we can write to a given offset efficiently. Instead, our concern is more along the lines of being efficient at a larger scale—specifically what will balance potential memory fragmentation, as well as minimize the number of keys that are necessary. For this example, we’ll store location information for 220 users (just over 1 million entries) per STRING, which will use about 2 megabytes per STRING. In the next listing, we see the code for updating location information for a user.

Listing 9.15A function for storing location data in sharded STRINGs
USERS_PER_SHARD = 2**20

Set the size of each shard.

def set_location(conn, user_id, country, state):
   code = get_code(country, state)

Get the location code to store for the user.

   shard_id, position = divmod(user_id, USERS_PER_SHARD)

Find the shard ID and position of the user in the specific shard.

   offset = position * 2

Calculate the offset of the user’s data.

   pipe = conn.pipeline(False)
   pipe.setrange('location:%s'%shard_id, offset, code)

Set the value in the proper sharded location table.

   tkey = str(uuid.uuid4())
   pipe.zadd(tkey, 'max', user_id)
   pipe.zunionstore('location:max',
      [tkey, 'location:max'], aggregate='max')
   pipe.delete(tkey)

Update a ZSET that stores the maximum user ID seen so far.

   pipe.execute()

For the most part, there shouldn’t be anything surprising there. We calculate the location code to store for a user, calculate the shard and the individual shard offset for the user, and then store the location code in the proper location for the user. The only thing that’s strange and may not seem necessary is that we also update a ZSET that stores the highest-numbered user ID that has been seen. This is primarily important when calculating aggregates over everyone we have information about (so we know when to stop).