One of the primary uses for HASHes is storing simple key/value pairs in a grouped fashion.
Back in section 5.3, we developed a method of mapping IP addresses to locations
around the world. In addition to a ZSET that mapped from IP addresses to city IDs, we
used a single HASH that mapped from city IDs to information about the city itself. That
HASH had more than 370,000 entries using the August 2012 version of the database,
which we’ll now shard.
To shard a HASH table, we need to choose a method of partitioning our data.
Because HASHes themselves have keys, we can use those keys as a source of information
to partition the keys. For partitioning our keys, we’ll generally calculate a hash function
on the key itself that will produce a number. Depending on how many keys we
want to fit in a single shard and how many total keys we need to store, we’ll calculate
the number of shards we need, and use that along with our hash value to determine
the shard ID that the data will be stored in.
For numeric keys, we’ll assume that the keys will be more or less sequential and tightly
packed, and will assign them to a shard ID based on their numeric key value (keeping
numerically similar keys in the same shard). The next listing shows our function for calculating
a new key for a sharded HASH, given the base key and the HASH key HASH.
In our function, you’ll notice that for non-numeric keys we calculate a CRC32 checksum.
We’re using CRC32 in this case because it returns a simple integer without additional
work, is fast to calculate (much faster than MD5 or SHA1 hashes), and because
it’ll work well enough for most situations.
BEING CONSISTENT ABOUT total_elements AND shard_sizeWhen using nonnumeric
keys to shard on, you’ll notice that we use the total_elements value
to calculate the total number of shards necessary, in addition to the
shard_size that’s used for both numeric and non-numeric keys. These two
pieces of information are necessary to keep the total number of shards down.
If you were to change either of these numbers, then the number of shards
(and thus the shard that any piece of data goes to) will change. Whenever
possible, you shouldn’t change either of these values, or when you do change
them, you should have a process for moving your data from the old data
shards to the new data shards (this is generally known as resharding).
We’ll now use our shard_key() to pick shards as part of two functions that will work
like HSET and HGET on sharded hashes in the following listing.
Nothing too complicated there; we’re finding the proper location for the data to be
stored or fetched from the HASH and either setting or getting the values. To update
our earlier IP-address-to-city lookup calls, we only need to replace one call in each of
two functions. The next listing shows just those parts of our earlier functions that
need to be updated.
On a 64-bit machine, storing the single HASH of all of our cities takes up roughly 44
megabytes. But with these few small changes to shard our data, setting hash-maxziplist-
entries to 1024 and hash-max-ziplist-value to 256 (the longest city/
country name in the list is a bit over 150 characters), the sharded HASHes together take
up roughly 12 megabytes. That’s a 70% reduction in data size, which would allow us to
store 3.5 times as much data as before. For shorter keys and values, you can potentially
see even greater percentage savings (overhead is larger relative to actual data stored).
STORING STRINGS IN HASHESIf you find yourself storing a lot of relatively
short strings or numbers as plain STRING values with consistently named keys
like namespace:id, you can store those values in sharded HASHes for significant
memory reduction in some cases.
As you saw, getting and setting values in a sharded HASH is easy. Can you add support
for sharded HDEL, HINCRBY, and HINCRBYFLOAT operations?
We’ve just finished sharding large hashes to reduce their memory use. Next, you’ll
learn how to shard SETs.