EBOOK – REDIS IN ACTION

This book covers the use of Redis, an in-memory database/data structure server.

  • Foreword
  • Preface
  • Acknowledgments
  • About this Book
  • About the Cover Illustration
  • Part 1: Getting Started
  • Part 2: Core concepts
  • Part 3: Next steps
  • Appendix A
  • Appendix B
  • Buy the paperback

    9.2.1 HASHes

    One of the primary uses for HASHes is storing simple key/value pairs in a grouped fashion.
    Back in section 5.3, we developed a method of mapping IP addresses to locations
    around the world. In addition to a ZSET that mapped from IP addresses to city IDs, we
    used a single HASH that mapped from city IDs to information about the city itself. That
    HASH had more than 370,000 entries using the August 2012 version of the database,
    which we’ll now shard.

    To shard a HASH table, we need to choose a method of partitioning our data.
    Because HASHes themselves have keys, we can use those keys as a source of information
    to partition the keys. For partitioning our keys, we’ll generally calculate a hash function
    on the key itself that will produce a number. Depending on how many keys we
    want to fit in a single shard and how many total keys we need to store, we’ll calculate
    the number of shards we need, and use that along with our hash value to determine
    the shard ID that the data will be stored in.

    For numeric keys, we’ll assume that the keys will be more or less sequential and tightly
    packed, and will assign them to a shard ID based on their numeric key value (keeping
    numerically similar keys in the same shard). The next listing shows our function for calculating
    a new key for a sharded HASH, given the base key and the HASH key HASH.

    Listing 9.7A function to calculate a shard key from a base key and a secondary entry key
    def shard_key(base, key, total_elements, shard_size):
    

    We’ll call the shard_key() function with a base HASH name,
    along with the key to be stored in the sharded HASH, the total
    number of expected elements, and the desired shard size.

       if isinstance(key, (int, long)) or key.isdigit():
    

    If the value is an integer or a string that looks like an integer, we’ll use it directly to calculate the shard ID.

          shard_id = int(str(key), 10) // shard_size
    

    For integers, we assume they are sequentially assigned IDs, so we can choose a shard ID based on the upper “bits” of the numeric ID itself. We also use an explicit base here (necessitating the str() call) so that a key of 010 turns into 10, and not 8.

       else:
    
          shards = 2 * total_elements // shard_size
    

    For non-integer keys, we first calculate the total number of shards desired, based on an expected total number of elements and desired shard size.

          shard_id = binascii.crc32(key) % shards
    

    When we know the number of shards, we hash the key modulo the number of shards.

       return "%s:%s"%(base, shard_id)
    

    Finally, we combine the base key with the shard ID we calculated to determine the shard key.

    In our function, you’ll notice that for non-numeric keys we calculate a CRC32 checksum.
    We’re using CRC32 in this case because it returns a simple integer without additional
    work, is fast to calculate (much faster than MD5 or SHA1 hashes), and because
    it’ll work well enough for most situations.

    BEING CONSISTENT ABOUT total_elements AND shard_sizeWhen using nonnumeric
    keys to shard on, you’ll notice that we use the total_elements value
    to calculate the total number of shards necessary, in addition to the
    shard_size that’s used for both numeric and non-numeric keys. These two
    pieces of information are necessary to keep the total number of shards down.
    If you were to change either of these numbers, then the number of shards
    (and thus the shard that any piece of data goes to) will change. Whenever
    possible, you shouldn’t change either of these values, or when you do change
    them, you should have a process for moving your data from the old data
    shards to the new data shards (this is generally known as resharding).

    We’ll now use our shard_key() to pick shards as part of two functions that will work
    like HSET and HGET on sharded hashes in the following listing.

    Listing 9.8Sharded HSET and HGET functions
    def shard_hset(conn, base, key, value, total_elements, shard_size):
    
       shard = shard_key(base, key, total_elements, shard_size)
    

    Calculate the shard to store our value in.

       return conn.hset(shard, key, value)
    

    Set the value in the shard.

    def shard_hget(conn, base, key, total_elements, shard_size):
    
       shard = shard_key(base, key, total_elements, shard_size)
    

    Calculate the shard to fetch our value from.

       return conn.hget(shard, key)
    

    Get the value in the shard.

    Nothing too complicated there; we’re finding the proper location for the data to be
    stored or fetched from the HASH and either setting or getting the values. To update
    our earlier IP-address-to-city lookup calls, we only need to replace one call in each of
    two functions. The next listing shows just those parts of our earlier functions that
    need to be updated.

    Listing 9.9Sharded IP lookup functions
    TOTAL_SIZE = 320000
    SHARD_SIZE = 1024
    

    We set the arguments for the sharded calls as global constants to ensure that we always pass the same information.

    def import_cities_to_redis(conn, filename):
       for row in csv.reader(open(filename)):
          ...
    
          shard_hset(conn, 'cityid2city:', city_id,
             json.dumps([city, region, country]),
             TOTAL_SIZE, SHARD_SIZE)
    

    To set the data, we need to pass the TOTAL_SIZE and SHARD_SIZE information, though in this case TOTAL_SIZE is unused because our IDs are numeric.

    def find_city_by_ip(conn, ip_address):
       ...
    
       data = shard_hget(conn, 'cityid2city:', city_id,
          TOTAL_SIZE, SHARD_SIZE)
    

    To fetch the data, we need to use the same information for TOTAL_SIZE and SHARD_SIZE for general sharded keys.

       return json.loads(data)
    

    On a 64-bit machine, storing the single HASH of all of our cities takes up roughly 44
    megabytes. But with these few small changes to shard our data, setting hash-maxziplist-
    entries
    to 1024 and hash-max-ziplist-value to 256 (the longest city/
    country name in the list is a bit over 150 characters), the sharded HASHes together take
    up roughly 12 megabytes. That’s a 70% reduction in data size, which would allow us to
    store 3.5 times as much data as before. For shorter keys and values, you can potentially
    see even greater percentage savings (overhead is larger relative to actual data stored).

    STORING STRINGS IN HASHESIf you find yourself storing a lot of relatively
    short strings or numbers as plain STRING values with consistently named keys
    like namespace:id, you can store those values in sharded HASHes for significant
    memory reduction in some cases.

    Exercise: Adding other operations

    As you saw, getting and setting values in a sharded HASH is easy. Can you add support
    for sharded HDEL, HINCRBY, and HINCRBYFLOAT operations?

    We’ve just finished sharding large hashes to reduce their memory use. Next, you’ll
    learn how to shard SETs.