e-Book - Redis in Action

This book covers the use of Redis, an in-memory database/data structure server.
  • Foreword
  • Preface
  • Acknowledgments
  • About this Book
  • About the Cover Illustration
  • Part 1: Getting Started
  • Part 2: Core concepts
  • Part 3: Next steps
  • Appendix A
  • Appendix B
  • Buy the paperback

    9.2.2 SETs

    One common use of an operation known as map-reduce (which I mentioned in chapters
    1 and 6) is calculating unique visitors to a website. Rather than waiting until the
    end of the day to perform that calculation, we could instead keep a live updated count
    of unique visitors as the day goes on. One method to calculate unique visitors in Redis
    would use a SET, but a single SET storing many unique visitors would be very large.
    In this section, we’ll shard SETs as a way of building a method to count unique visitors
    to a website.

    To start, we’ll assume that every visitor already has a unique identifier similar to
    the UUIDs that we generated in chapter 2 for our login session cookies. Though we
    could use these UUIDs directly in our SET as members and as keys to shard using our
    sharding function from section 9.2.1, we’d lose the benefit of the intset encoding.
    Assuming that we generated our UUIDs randomly (as we’ve done in previous chapters),
    we could instead use the first 15 hexadecimal digits from the UUID as a full key.
    This should bring up two questions: First, why would we want to do this? And second,
    why is this enough?

    For the first question (why we’d want to do this), UUIDs are basically 128-bit numbers
    that have been formatted in an easy-to-read way. If we were to store them, we’d be
    storing roughly 16 bytes (or 36 if we stored them as-is) per unique visitor. But by only
    storing the first 15 hexadecimal digits3 turned into a number, we’d only be storing 8
    bytes per unique visitor. So we save space up front, which we may be able to use later
    for other problems. This also lets us use the intset optimization for keeping memory
    use down.

    For the second question (why this is enough), it boils down to what are called birthday
    collisions
    . Put simply: What are the chances of two 128-bit random identifiers matching
    in the first 56 bits? Mathematically, we can calculate the chances exactly, and as long
    as we have fewer than 250 million unique visitors in a given time period (a day in our
    case), we’ll have at most a 1% chance of a single match (so if every day we have 250 million
    visitors, about once every 100 days we’ll have about 1 person not counted). If we
    have fewer than 25 million unique visitors, then the chance of not counting a user falls
    to the point where we’d need to run the site for roughly 2,739 years before we’d miss
    counting a single user.

    Now that we’ve decided to use the first 56 bits from the UUID, we’ll build a sharded
    SADD function, which we’ll use as part of a larger bit of code to actually count unique
    visitors. This sharded SADD function in listing 9.10 will use the same shard key calculation
    that we used in section 9.2.1, modified to prefix our numeric ID with a nonnumeric
    character for shard ID calculation, since our 56-bit IDs aren’t densely packed
    (as is the assumption for numeric IDs).

    Listing 9.10A sharded SADD function we’ll use as part of a unique visitor counter
    def shard_sadd(conn, base, member, total_elements, shard_size):
       shard = shard_key(base,
    
          'x'+str(member), total_elements, shard_size)
    

    Shard the member into one of the sharded SETs; remember to turn it into a string because it isn’t a sequential ID.

       return conn.sadd(shard, member)
    

    Actually add the member to the shard.

    With a sharded SADD function, we can now keep unique visitor counts. When we want
    to count a visitor, we’ll first calculate their shorter ID based on the first 56 bits of their
    session UUID. We’ll then determine today’s date and add the ID to the sharded unique
    visitor SET for today. If the ID wasn’t already in the SET, we’ll increment today’s unique
    visitor count. Our code for keeping track of the unique visitor count can be seen in
    the following listing.

    Listing 9.11A function to keep track of the unique visitor count on a daily basis
    SHARD_SIZE = 512
    

    We stick with a typical shard size for the intset encoding for SETs.

    def count_visit(conn, session_id):
    
       today = date.today()
       key = 'unique:%s'%today.isoformat()
    

    Get today’s date and generate the key for the unique count.

       expected = get_expected(conn, key, today)
    

    Fetch or calculate the expected number of unique views today.

       id = int(session_id.replace('-', '')[:15], 16)
    

    Calculate the s56-bit ID for this s128-bit UUID.

       if shard_sadd(conn, key, id, expected, SHARD_SIZE):
    

    Add the ID to the sharded SET.

          conn.incr(key)
    

    If the ID wasn’t in the sharded SET, then we increment our unique view count.

    That function works exactly as described, though you’ll notice that we make a call to
    get_expected() to determine the number of expected daily visitors. We do this
    because web page visits will tend to change over time, and keeping the same number
    of shards every day wouldn’t grow as we grow (or shrink if we have significantly fewer
    than a million unique visitors daily).

    To address the daily change in expected viewers, we’ll write a function that calculates
    a new expected number of unique visitors for each day, based on yesterday’s count.
    We’ll calculate this once for any given day, estimating that today will see at least 50%
    more visitors than yesterday, rounded up to the next power of 2. Our code for calculating
    this can be seen next.

    Listing 9.12Calculate today’s expected unique visitor count based on yesterday’s count
    DAILY_EXPECTED = 1000000
    

    We start with an initial expected number of daily visits that may be a little high.

    EXPECTED = {}
    

    Keep a local copy of any calculated expected counts.

    def get_expected(conn, key, today):
    
       if key in EXPECTED:
          return EXPECTED[key]
    

    If we’ve already calculated or seen the expected number of views for today, use that number.

       exkey = key + ':expected'
    
       expected = conn.get(exkey)
    

    If someone else has already calculated the expected number of views for today, use that number.

       if not expected:
    
          yesterday = (today - timedelta(days=1)).isoformat()
          expected = conn.get('unique:%s'%yesterday)
          expected = int(expected or DAILY_EXPECTED)
    

    Fetch the unique count for yesterday, or if not available, use our default 1 million.

          expected = 2**int(math.ceil(math.log(expected*1.5, 2)))
    

    Add 50% to yesterday’s count, and round up to the next even power of 2, under the assumption that view count today should be at least 50% better than yesterday.

          if not conn.setnx(exkey, expected):
    

    Save our calculated expected number of views back to Redis for other calls if possible.

             expected = conn.get(exkey)
    

    If someone else stored the expected count for today before us, use their count instead.

       EXPECTED[key] = int(expected)
       return EXPECTED[key]
    

    Keep a local copy of today’s expected number of hits, and return it back to the caller.

    Most of that function is reading and massaging data in one way or another, but the overall
    result is that we calculate an expected number of unique views for today by taking
    yesterday’s view count, increasing it by 50%, and rounding up to the next power of 2.
    If the expected number of views for today has already been calculated, we’ll use that.

    Taking this exact code and adding 1 million unique visitors, Redis will use approximately
    9.5 megabytes to store the unique visitor count. Without sharding, Redis
    would use 56 megabytes to store the same data (56-bit integer IDs in a single SET).
    That’s an 83% reduction in storage with sharding, which would let us store 5.75 times
    as much data with the same hardware.

    Exercise: Filling out the sharded SET API

    For this example, we only needed a single SET command to determine the unique visitor
    count for a given day. Can you add sharded SREM and SISMEMBER calls? Bonus
    points: Assuming that you have two sharded SETs with the same expected total number
    of items, as well as the same shard size, you’ll have the same number of shards,
    and identical IDs will be in the same shard IDs. Can you add sharded versions of SINTERSTORE,
    SUNIONSTORE, and SDIFFSTORE?

    OTHER METHODS TO CALCULATE UNIQUE VISITOR COUNTSIf you have numeric
    visitor IDs (instead of UUIDs), and the visitor IDs have relatively low maximum
    value, rather than storing your visitor information as sharded SETs, you can
    store them as bitmaps using techniques similar to what we describe in the
    next section. A Python library for calculating unique visitor counts and other
    interesting analytics based on bitmaps can be found at https://github.com/Doist/bitmapist.

    After sharding large SETs of integers to reduce storage, it’s now time to learn how to
    pack bits and bytes into STRINGs.

    3 Another good question is why 56 and not 64 bits? That’s because Redis will only use intsets for up to 64-bit signed integers, and the extra work of turning our 64-bit unsigned integer into a signed integer isn’t worth it in most situations. If you need the extra precision, check out the Python struct module and look at the Q and q format codes.