This book covers the use of Redis, an in-memory database/data structure server.

open all | close all

9.2.1 HASHes

One of the primary uses for HASHes is storing simple key/value pairs in a grouped fashion.
Back in section 5.3, we developed a method of mapping IP addresses to locations
around the world. In addition to a ZSET that mapped from IP addresses to city IDs, we
used a single HASH that mapped from city IDs to information about the city itself. That
HASH had more than 370,000 entries using the August 2012 version of the database,
which we’ll now shard.

To shard a HASH table, we need to choose a method of partitioning our data.
Because HASHes themselves have keys, we can use those keys as a source of information
to partition the keys. For partitioning our keys, we’ll generally calculate a hash function
on the key itself that will produce a number. Depending on how many keys we
want to fit in a single shard and how many total keys we need to store, we’ll calculate
the number of shards we need, and use that along with our hash value to determine
the shard ID that the data will be stored in.

For numeric keys, we’ll assume that the keys will be more or less sequential and tightly
packed, and will assign them to a shard ID based on their numeric key value (keeping
numerically similar keys in the same shard). The next listing shows our function for calculating
a new key for a sharded HASH, given the base key and the HASH key HASH.

Listing 9.7A function to calculate a shard key from a base key and a secondary entry key
def shard_key(base, key, total_elements, shard_size):

We’ll call the shard_key() function with a base HASH name,
along with the key to be stored in the sharded HASH, the total
number of expected elements, and the desired shard size.

   if isinstance(key, (int, long)) or key.isdigit():

If the value is an integer or a string that looks like an integer, we’ll use it directly to calculate the shard ID.

      shard_id = int(str(key), 10) // shard_size

For integers, we assume they are sequentially assigned IDs, so we can choose a shard ID based on the upper “bits” of the numeric ID itself. We also use an explicit base here (necessitating the str() call) so that a key of 010 turns into 10, and not 8.

      shards = 2 * total_elements // shard_size

For non-integer keys, we first calculate the total number of shards desired, based on an expected total number of elements and desired shard size.

      shard_id = binascii.crc32(key) % shards

When we know the number of shards, we hash the key modulo the number of shards.

   return "%s:%s"%(base, shard_id)

Finally, we combine the base key with the shard ID we calculated to determine the shard key.

In our function, you’ll notice that for non-numeric keys we calculate a CRC32 checksum.
We’re using CRC32 in this case because it returns a simple integer without additional
work, is fast to calculate (much faster than MD5 or SHA1 hashes), and because
it’ll work well enough for most situations.

BEING CONSISTENT ABOUT total_elements AND shard_sizeWhen using nonnumeric
keys to shard on, you’ll notice that we use the total_elements value
to calculate the total number of shards necessary, in addition to the
shard_size that’s used for both numeric and non-numeric keys. These two
pieces of information are necessary to keep the total number of shards down.
If you were to change either of these numbers, then the number of shards
(and thus the shard that any piece of data goes to) will change. Whenever
possible, you shouldn’t change either of these values, or when you do change
them, you should have a process for moving your data from the old data
shards to the new data shards (this is generally known as resharding).

We’ll now use our shard_key() to pick shards as part of two functions that will work
like HSET and HGET on sharded hashes in the following listing.

Listing 9.8Sharded HSET and HGET functions
def shard_hset(conn, base, key, value, total_elements, shard_size):
   shard = shard_key(base, key, total_elements, shard_size)

Calculate the shard to store our value in.

   return conn.hset(shard, key, value)

Set the value in the shard.

def shard_hget(conn, base, key, total_elements, shard_size):
   shard = shard_key(base, key, total_elements, shard_size)

Calculate the shard to fetch our value from.

   return conn.hget(shard, key)

Get the value in the shard.

Nothing too complicated there; we’re finding the proper location for the data to be
stored or fetched from the HASH and either setting or getting the values. To update
our earlier IP-address-to-city lookup calls, we only need to replace one call in each of
two functions. The next listing shows just those parts of our earlier functions that
need to be updated.

Listing 9.9Sharded IP lookup functions
TOTAL_SIZE = 320000

We set the arguments for the sharded calls as global constants to ensure that we always pass the same information.

def import_cities_to_redis(conn, filename):
   for row in csv.reader(open(filename)):
      shard_hset(conn, 'cityid2city:', city_id,
         json.dumps([city, region, country]),

To set the data, we need to pass the TOTAL_SIZE and SHARD_SIZE information, though in this case TOTAL_SIZE is unused because our IDs are numeric.

def find_city_by_ip(conn, ip_address):
   data = shard_hget(conn, 'cityid2city:', city_id,

To fetch the data, we need to use the same information for TOTAL_SIZE and SHARD_SIZE for general sharded keys.

   return json.loads(data)

On a 64-bit machine, storing the single HASH of all of our cities takes up roughly 44
megabytes. But with these few small changes to shard our data, setting hash-maxziplist-
to 1024 and hash-max-ziplist-value to 256 (the longest city/
country name in the list is a bit over 150 characters), the sharded HASHes together take
up roughly 12 megabytes. That’s a 70% reduction in data size, which would allow us to
store 3.5 times as much data as before. For shorter keys and values, you can potentially
see even greater percentage savings (overhead is larger relative to actual data stored).

STORING STRINGS IN HASHESIf you find yourself storing a lot of relatively
short strings or numbers as plain STRING values with consistently named keys
like namespace:id, you can store those values in sharded HASHes for significant
memory reduction in some cases.

Exercise: Adding other operations

As you saw, getting and setting values in a sharded HASH is easy. Can you add support
for sharded HDEL, HINCRBY, and HINCRBYFLOAT operations?

We’ve just finished sharding large hashes to reduce their memory use. Next, you’ll
learn how to shard SETs.