This book covers the use of Redis, an in-memory database/data structure server.

open all | close all

6.1.2 Address book autocomplete

In the previous example, Redis was used primarily to keep track of the contact list, not
to actually perform the autocomplete. This is okay for short lists, but for longer lists,
fetching thousands or millions of items to find just a handful would be a waste.
Instead, for autocomplete lists with many items, we must find matches inside Redis.

Going back to Fake Game Company, the recent contacts chat autocomplete is one
of the most-used social features of our game. Our number-two feature, in-game mailing,
has been gaining momentum. To keep the momentum going, we’ll add an autocomplete
for mailing. But in our game, we only allow users to send mail to other users
that are in the same in-game social group as they are, which we call a guild. This helps
to prevent abusive and unsolicited messages between users.

Guilds can grow to thousands of members, so we can’t use our old LIST-based autocomplete
method. But because we only need one autocomplete list per guild, we can
use more space per member. To minimize the amount of data to be transferred to clients
who are autocompleting, we’ll perform the autocomplete prefix calculation
inside Redis using ZSETs.

To store each autocomplete list will be different than other ZSET uses that you’ve
seen before. Mostly, we’ll use ZSETs for their ability to quickly tell us whether an item
is in the ZSET, what position (or index) a member occupies, and to quickly pull ranges
of items from anywhere inside the ZSET. What makes this use different is that all of our
scores will be zero. By setting our scores to zero, we use a secondary feature of ZSETs: ZSETs sort by member names when scores are equal. When all scores are zero, all
members are sorted based on the binary ordering of the strings. In order to actually
perform the autocomplete, we’ll insert lowercased contact names. Conveniently
enough, we’ve only ever allowed users to have letters in their names, so we don’t need
to worry about numbers or symbols.

What do we do? Let’s start by thinking of names as a sorted sequence of strings like
abc, abca, abcb, … abd. If we’re looking for words with a prefix of abc, we’re really
looking for strings that are after abbz… and before abd. If we knew the rank of the first
item that is before abbz… and the last item after abd, we could perform a ZRANGE call
to fetch items between them. But, because we don’t know whether either of those
items are there, we’re stuck. To become unstuck, all we really need to do is to insert
items that we know are after abbz… and before abd, find their ranks, make our ZRANGE
call, and then remove our start and end members.

The good news is that finding an item that’s before abd but still after all valid
names with a prefix of abc is easy: we concatenate a { (left curly brace) character onto
the end of our prefix, giving us abc{. Why {? Because it’s the next character in ASCII
after z. To find the start of our range for abc, we could also concatenate { to abb, getting
abb{, but what if our prefix was aba instead of abc? How do we find a character
before a? We take a hint from our use of the curly brace, and note that the character
that precedes a in ASCII is ` (back quote). So if our prefix is aba, our start member will
be ab`, and our end member will be aba{.

Putting it all together, we’ll find the predecessor of our prefix by replacing the last
character of our prefix with the character that came right before it. We’ll find the successor
of our prefix by concatenating a curly brace. To prevent any issues with two prefix
searches happening at the same time, we’ll concatenate a curly brace onto our
prefix (for post-filtering out endpoint items if necessary). A function that will generate
these types of ranges can be seen next.

Listing 6.3The find_prefix_range() function
valid_characters = '`abcdefghijklmnopqrstuvwxyz{'

Set up our list of characters that we know about.

def find_prefix_range(prefix):
   posn = bisect.bisect_left(valid_characters, prefix[-1:])

Find the position of prefix character in our list of characters.

   suffix = valid_characters[(posn or 1) - 1]

Find the predecessor character.

   return prefix[:-1] + suffix + '{', prefix + '{'

Return the range.

I know, it can be surprising to have spent so many paragraphs describing what we’re
going to do, only to end up with just a few lines that actually implement it. But if we
look at what we’re doing, we’re just finding the last character in the prefix in our presorted
sequence of characters (using the bisect module), and then looking up the
character that came just before it.

CHARACTER SETS AND INTERNATIONALIZATIONThis method of finding the preceding
and following characters in ASCII works really well for languages with
characters that only use characters a-z. But when confronted with characters
that aren’t in this range, you’ll find a few new challenges.

First, you’ll have to find a method that turns all of your characters into bytes;
three common encodings include UTF-8, UTF-16, and UTF-32 (big-endian
and little-endian variants of UTF-16 and UTF-32 are used, but only big-endian
versions work in this situation). Second, you’ll need to find the range of characters
that you intend to support, ensuring that your chosen encoding leaves
at least one character before your supported range and one character after
your selected range in the encoded version. Third, you need to use these
characters to replace the back quote character ` and the left curly brace character
{ in our example.

Thankfully, our algorithm doesn’t care about the native sort order of the characters,
only the encodings. So you can pick UTF-8 or big-endian UTF-16 or
UTF-32, use a null to replace the back quote, and use the maximum value that
your encoding and language supports to replace the left curly brace. (Some language
bindings are somewhat limited, allowing only up to Unicode code point
U+ffff for UTF-16 and Unicode code point U+2ffff for UTF-32.)

After we have the range of values that we’re looking for, we need to insert our ending
points into the ZSET, find the rank of those newly added items, pull some number of
items between them (we’ll fetch at most 10 to avoid overwhelming the user), and then
remove our added items. To ensure that we’re not adding and removing the same
items, as would be the case if two members of the same guild were trying to message
the same user, we’ll also concatenate a 128-bit randomly generated UUID to our start
and end members. To make sure that the ZSET isn’t being changed when we try to
find and fetch our ranges, we’ll use WATCH with MULTI and EXEC after we’ve inserted
our endpoints. The full autocomplete function is shown here.

Listing 6.4The autocomplete_on_prefix() function
def autocomplete_on_prefix(conn, guild, prefix):
   start, end = find_prefix_range(prefix)
   identifier = str(uuid.uuid4())
   start += identifier
   end += identifier

Find the start/ end range for the prefix.

   zset_name = 'members:' + guild

   conn.zadd(zset_name, start, 0, end, 0)

Add the start/ end range items to the ZSET.

   pipeline = conn.pipeline(True)
   while 1:
         sindex = pipeline.zrank(zset_name, start)
         eindex = pipeline.zrank(zset_name, end)
         erange = min(sindex + 9, eindex - 2)

Find the ranks of our end points.


Most of this function is bookkeeping and setup. The first part is just getting our start
and ending points, followed by adding them to the guild’s autocomplete ZSET. When
we have everything in the ZSET, we WATCH the ZSET to ensure that we discover if someone
has changed it, fetch the ranks of the start and end points in the ZSET, fetch items
between the endpoints, and clean up after ourselves.

To add and remove members from a guild is straightforward: we only need to ZADD
and ZREM the user from the guild’s ZSET. Both of these functions are shown here.

Listing 6.5The join_guild() and leave_guild() functions
def join_guild(conn, guild, user):
   conn.zadd('members:' + guild, user, 0)

def leave_guild(conn, guild, user):
   conn.zrem('members:' + guild, user)

Joining or leaving a guild, at least when it comes to autocomplete, is straightforward.
We only need to add or remove names from the ZSET.

This method of adding items to a ZSET to create a range—fetching items in the range
and then removing those added items—can be useful. Here we use it for autocomplete,
but this technique can also be used for arbitrary sorted indexes. In chapter 7, we’ll talk
about a technique for improving these kinds of operations for a few different types of
range queries, which removes the need to add and remove items at the endpoints. We’ll
wait to talk about the other method, because it only works on some types of data, whereas
this method works on range queries over any type of data.

When we added our endpoints to the ZSET, we needed to be careful about other
users trying to autocomplete at the same time, which is why we use the WATCH command.
As our load increases, we may need to retry our operations often, which can be
wasteful. The next section talks about a way to avoid retries, improve performance,
and sometimes simplify our code by reducing and/or replacing WATCH with locks.