e-Book - Redis in Action

This book covers the use of Redis, an in-memory database/data structure server.
  • Foreword
  • Preface
  • Acknowledgments
  • About this Book
  • About the Cover Illustration
  • Part 1: Getting Started
  • Part 2: Core concepts
  • Part 3: Next steps
  • Appendix A
  • Appendix B
  • Buy the paperback

    6.1.2 Address book autocomplete

    In the previous example, Redis was used primarily to keep track of the contact list, not
    to actually perform the autocomplete. This is okay for short lists, but for longer lists,
    fetching thousands or millions of items to find just a handful would be a waste.
    Instead, for autocomplete lists with many items, we must find matches inside Redis.

    Going back to Fake Game Company, the recent contacts chat autocomplete is one
    of the most-used social features of our game. Our number-two feature, in-game mailing,
    has been gaining momentum. To keep the momentum going, we’ll add an autocomplete
    for mailing. But in our game, we only allow users to send mail to other users
    that are in the same in-game social group as they are, which we call a guild. This helps
    to prevent abusive and unsolicited messages between users.

    Guilds can grow to thousands of members, so we can’t use our old LIST-based autocomplete
    method. But because we only need one autocomplete list per guild, we can
    use more space per member. To minimize the amount of data to be transferred to clients
    who are autocompleting, we’ll perform the autocomplete prefix calculation
    inside Redis using ZSETs.

    To store each autocomplete list will be different than other ZSET uses that you’ve
    seen before. Mostly, we’ll use ZSETs for their ability to quickly tell us whether an item
    is in the ZSET, what position (or index) a member occupies, and to quickly pull ranges
    of items from anywhere inside the ZSET. What makes this use different is that all of our
    scores will be zero. By setting our scores to zero, we use a secondary feature of ZSETs: ZSETs sort by member names when scores are equal. When all scores are zero, all
    members are sorted based on the binary ordering of the strings. In order to actually
    perform the autocomplete, we’ll insert lowercased contact names. Conveniently
    enough, we’ve only ever allowed users to have letters in their names, so we don’t need
    to worry about numbers or symbols.

    What do we do? Let’s start by thinking of names as a sorted sequence of strings like
    abc, abca, abcb, … abd. If we’re looking for words with a prefix of abc, we’re really
    looking for strings that are after abbz… and before abd. If we knew the rank of the first
    item that is before abbz… and the last item after abd, we could perform a ZRANGE call
    to fetch items between them. But, because we don’t know whether either of those
    items are there, we’re stuck. To become unstuck, all we really need to do is to insert
    items that we know are after abbz… and before abd, find their ranks, make our ZRANGE
    call, and then remove our start and end members.

    The good news is that finding an item that’s before abd but still after all valid
    names with a prefix of abc is easy: we concatenate a { (left curly brace) character onto
    the end of our prefix, giving us abc{. Why {? Because it’s the next character in ASCII
    after z. To find the start of our range for abc, we could also concatenate { to abb, getting
    abb{, but what if our prefix was aba instead of abc? How do we find a character
    before a? We take a hint from our use of the curly brace, and note that the character
    that precedes a in ASCII is ` (back quote). So if our prefix is aba, our start member will
    be ab`, and our end member will be aba{.

    Putting it all together, we’ll find the predecessor of our prefix by replacing the last
    character of our prefix with the character that came right before it. We’ll find the successor
    of our prefix by concatenating a curly brace. To prevent any issues with two prefix
    searches happening at the same time, we’ll concatenate a curly brace onto our
    prefix (for post-filtering out endpoint items if necessary). A function that will generate
    these types of ranges can be seen next.

    Listing 6.3The find_prefix_range() function
    valid_characters = '`abcdefghijklmnopqrstuvwxyz{'
    
    

    Set up our list of characters that we know about.

    def find_prefix_range(prefix):
    
       posn = bisect.bisect_left(valid_characters, prefix[-1:])
    

    Find the position of prefix character in our list of characters.

       suffix = valid_characters[(posn or 1) - 1]
    

    Find the predecessor character.

       return prefix[:-1] + suffix + '{', prefix + '{'
    

    Return the range.

    I know, it can be surprising to have spent so many paragraphs describing what we’re
    going to do, only to end up with just a few lines that actually implement it. But if we
    look at what we’re doing, we’re just finding the last character in the prefix in our presorted
    sequence of characters (using the bisect module), and then looking up the
    character that came just before it.

    CHARACTER SETS AND INTERNATIONALIZATIONThis method of finding the preceding
    and following characters in ASCII works really well for languages with
    characters that only use characters a-z. But when confronted with characters
    that aren’t in this range, you’ll find a few new challenges.

    First, you’ll have to find a method that turns all of your characters into bytes;
    three common encodings include UTF-8, UTF-16, and UTF-32 (big-endian
    and little-endian variants of UTF-16 and UTF-32 are used, but only big-endian
    versions work in this situation). Second, you’ll need to find the range of characters
    that you intend to support, ensuring that your chosen encoding leaves
    at least one character before your supported range and one character after
    your selected range in the encoded version. Third, you need to use these
    characters to replace the back quote character ` and the left curly brace character
    { in our example.

    Thankfully, our algorithm doesn’t care about the native sort order of the characters,
    only the encodings. So you can pick UTF-8 or big-endian UTF-16 or
    UTF-32, use a null to replace the back quote, and use the maximum value that
    your encoding and language supports to replace the left curly brace. (Some language
    bindings are somewhat limited, allowing only up to Unicode code point
    U+ffff for UTF-16 and Unicode code point U+2ffff for UTF-32.)

    After we have the range of values that we’re looking for, we need to insert our ending
    points into the ZSET, find the rank of those newly added items, pull some number of
    items between them (we’ll fetch at most 10 to avoid overwhelming the user), and then
    remove our added items. To ensure that we’re not adding and removing the same
    items, as would be the case if two members of the same guild were trying to message
    the same user, we’ll also concatenate a 128-bit randomly generated UUID to our start
    and end members. To make sure that the ZSET isn’t being changed when we try to
    find and fetch our ranges, we’ll use WATCH with MULTI and EXEC after we’ve inserted
    our endpoints. The full autocomplete function is shown here.

    Listing 6.4The autocomplete_on_prefix() function
    def autocomplete_on_prefix(conn, guild, prefix):
    
       start, end = find_prefix_range(prefix)
       identifier = str(uuid.uuid4())
       start += identifier
       end += identifier
    

    Find the start/ end range for the prefix.

       zset_name = 'members:' + guild
    
    
       conn.zadd(zset_name, start, 0, end, 0)
    

    Add the start/ end range items to the ZSET.

       pipeline = conn.pipeline(True)
       while 1:
          try:
             pipeline.watch(zset_name)
    
             sindex = pipeline.zrank(zset_name, start)
             eindex = pipeline.zrank(zset_name, end)
             erange = min(sindex + 9, eindex - 2)
    

    Find the ranks of our end points.

             pipeline.multi()
    

    Most of this function is bookkeeping and setup. The first part is just getting our start
    and ending points, followed by adding them to the guild’s autocomplete ZSET. When
    we have everything in the ZSET, we WATCH the ZSET to ensure that we discover if someone
    has changed it, fetch the ranks of the start and end points in the ZSET, fetch items
    between the endpoints, and clean up after ourselves.

    To add and remove members from a guild is straightforward: we only need to ZADD
    and ZREM the user from the guild’s ZSET. Both of these functions are shown here.

    Listing 6.5The join_guild() and leave_guild() functions
    def join_guild(conn, guild, user):
       conn.zadd('members:' + guild, user, 0)
    
    def leave_guild(conn, guild, user):
       conn.zrem('members:' + guild, user)
    

    Joining or leaving a guild, at least when it comes to autocomplete, is straightforward.
    We only need to add or remove names from the ZSET.

    This method of adding items to a ZSET to create a range—fetching items in the range
    and then removing those added items—can be useful. Here we use it for autocomplete,
    but this technique can also be used for arbitrary sorted indexes. In chapter 7, we’ll talk
    about a technique for improving these kinds of operations for a few different types of
    range queries, which removes the need to add and remove items at the endpoints. We’ll
    wait to talk about the other method, because it only works on some types of data, whereas
    this method works on range queries over any type of data.

    When we added our endpoints to the ZSET, we needed to be careful about other
    users trying to autocomplete at the same time, which is why we use the WATCH command.
    As our load increases, we may need to retry our operations often, which can be
    wasteful. The next section talks about a way to avoid retries, improve performance,
    and sometimes simplify our code by reducing and/or replacing WATCH with locks.