Documentation - Redise Pack

A guide to Redise Pack installation, operation and administration

open all | close all

6.1.2 Address book autocomplete

In the previous example, Redis was used primarily to keep track of the contact list, not to actually perform the autocomplete. This is okay for short lists, but for longer lists, fetching thousands or millions of items to find just a handful would be a waste. Instead, for autocomplete lists with many items, we must find matches inside Redis.

Going back to Fake Game Company, the recent contacts chat autocomplete is one of the most-used social features of our game. Our number-two feature, in-game mailing, has been gaining momentum. To keep the momentum going, we’ll add an autocomplete for mailing. But in our game, we only allow users to send mail to other users that are in the same in-game social group as they are, which we call a guild. This helps to prevent abusive and unsolicited messages between users.

Guilds can grow to thousands of members, so we can’t use our old LIST-based autocomplete method. But because we only need one autocomplete list per guild, we can use more space per member. To minimize the amount of data to be transferred to clients who are autocompleting, we’ll perform the autocomplete prefix calculation inside Redis using ZSETs.

To store each autocomplete list will be different than other ZSET uses that you’ve seen before. Mostly, we’ll use ZSETs for their ability to quickly tell us whether an item is in the ZSET, what position (or index) a member occupies, and to quickly pull ranges of items from anywhere inside the ZSET. What makes this use different is that all of our scores will be zero. By setting our scores to zero, we use a secondary feature of ZSETs: ZSETs sort by member names when scores are equal. When all scores are zero, all members are sorted based on the binary ordering of the strings. In order to actually perform the autocomplete, we’ll insert lowercased contact names. Conveniently enough, we’ve only ever allowed users to have letters in their names, so we don’t need to worry about numbers or symbols.

What do we do? Let’s start by thinking of names as a sorted sequence of strings like abcabcaabcb, … abd. If we’re looking for words with a prefix of abc, we’re really looking for strings that are after abbz… and before abd. If we knew the rank of the first item that is before abbz… and the last item after abd, we could perform a ZRANGE call to fetch items between them. But, because we don’t know whether either of those items are there, we’re stuck. To become unstuck, all we really need to do is to insert items that we know are after abbz… and before abd, find their ranks, make our ZRANGE call, and then remove our start and end members.

The good news is that finding an item that’s before abd but still after all valid names with a prefix of abc is easy: we concatenate a { (left curly brace) character onto the end of our prefix, giving us abc{. Why {? Because it’s the next character in ASCII after z. To find the start of our range for abc, we could also concatenate { to abb, getting abb{, but what if our prefix was aba instead of abc? How do we find a character before a? We take a hint from our use of the curly brace, and note that the character that precedes a in ASCII is ` (back quote). So if our prefix is aba, our start member will be ab`, and our end member will be aba{.

Putting it all together, we’ll find the predecessor of our prefix by replacing the last character of our prefix with the character that came right before it. We’ll find the successor of our prefix by concatenating a curly brace. To prevent any issues with two prefix searches happening at the same time, we’ll concatenate a curly brace onto our prefix (for post-filtering out endpoint items if necessary). A function that will generate these types of ranges can be seen next.

Listing 6.3The find_prefix_range() function
valid_characters = '`abcdefghijklmnopqrstuvwxyz{'

Set up our list of characters that we know about.

def find_prefix_range(prefix):
   posn = bisect.bisect_left(valid_characters, prefix[-1:])

Find the position of prefix character in our list of characters.

   suffix = valid_characters[(posn or 1) - 1]

Find the predecessor character.

   return prefix[:-1] + suffix + '{', prefix + '{'

Return the range.

I know, it can be surprising to have spent so many paragraphs describing what we’re going to do, only to end up with just a few lines that actually implement it. But if we look at what we’re doing, we’re just finding the last character in the prefix in our presorted sequence of characters (using the bisect module), and then looking up the character that came just before it.

CHARACTER SETS AND INTERNATIONALIZATIONThis method of finding the preceding and following characters in ASCII works really well for languages with characters that only use characters a-z. But when confronted with characters that aren’t in this range, you’ll find a few new challenges.

First, you’ll have to find a method that turns all of your characters into bytes; three common encodings include UTF-8, UTF-16, and UTF-32 (big-endian and little-endian variants of UTF-16 and UTF-32 are used, but only big-endian
versions work in this situation). Second, you’ll need to find the range of characters that you intend to support, ensuring that your chosen encoding leaves at least one character before your supported range and one character after
your selected range in the encoded version. Third, you need to use these characters to replace the back quote character ` and the left curly brace character { in our example.

Thankfully, our algorithm doesn’t care about the native sort order of the characters, only the encodings. So you can pick UTF-8 or big-endian UTF-16 or UTF-32, use a null to replace the back quote, and use the maximum value that
your encoding and language supports to replace the left curly brace. (Some language bindings are somewhat limited, allowing only up to Unicode code point U+ffff for UTF-16 and Unicode code point U+2ffff for UTF-32.)

After we have the range of values that we’re looking for, we need to insert our ending points into the ZSET, find the rank of those newly added items, pull some number of items between them (we’ll fetch at most 10 to avoid overwhelming the user), and then remove our added items. To ensure that we’re not adding and removing the same items, as would be the case if two members of the same guild were trying to message the same user, we’ll also concatenate a 128-bit randomly generated UUID to our start and end members. To make sure that the ZSET isn’t being changed when we try to find and fetch our ranges, we’ll use WATCH with MULTI and EXEC after we’ve inserted
our endpoints. The full autocomplete function is shown here.

Listing 6.4 The autocomplete_on_prefix() function
def autocomplete_on_prefix(conn, guild, prefix):
   start, end = find_prefix_range(prefix)
   identifier = str(uuid.uuid4())
   start += identifier
   end += identifier

Find the start/ end range for the prefix.

   zset_name = 'members:' + guild

   conn.zadd(zset_name, start, 0, end, 0)

Add the start/ end range items to the ZSET.

   pipeline = conn.pipeline(True)
   while 1:
      try:
         pipeline.watch(zset_name)
         sindex = pipeline.zrank(zset_name, start)
         eindex = pipeline.zrank(zset_name, end)
         erange = min(sindex + 9, eindex - 2)

Find the ranks of our end points.

         pipeline.multi()

Most of this function is bookkeeping and setup. The first part is just getting our start and ending points, followed by adding them to the guild’s autocomplete ZSET. When we have everything in the ZSET, we WATCH the ZSET to ensure that we discover if someone has changed it, fetch the ranks of the start and end points in the ZSET, fetch items between the endpoints, and clean up after ourselves.

To add and remove members from a guild is straightforward: we only need to ZADD and ZREM the user from the guild’s ZSET. Both of these functions are shown here.

Listing 6.5The join_guild() and leave_guild() functions
def join_guild(conn, guild, user):
   conn.zadd('members:' + guild, user, 0)

def leave_guild(conn, guild, user):
   conn.zrem('members:' + guild, user)

Joining or leaving a guild, at least when it comes to autocomplete, is straightforward. We only need to add or remove names from the ZSET.

This method of adding items to a ZSET to create a range—fetching items in the range and then removing those added items—can be useful. Here we use it for autocomplete, but this technique can also be used for arbitrary sorted indexes. In chapter 7, we’ll talk about a technique for improving these kinds of operations for a few different types of range queries, which removes the need to add and remove items at the endpoints. We’ll wait to talk about the other method, because it only works on some types of data, whereas this method works on range queries over any type of data.

When we added our endpoints to the ZSET, we needed to be careful about other users trying to autocomplete at the same time, which is why we use the WATCH command. As our load increases, we may need to retry our operations often, which can be wasteful. The next section talks about a way to avoid retries, improve performance, and sometimes simplify our code by reducing and/or replacing WATCH with locks.