This book covers the use of Redis, an in-memory database/data structure server.

open all | close all

7.2.2 Non-numeric sorting with ZSETs

Typical comparison operations between strings will examine two strings character by
character until one character is different, one string runs out of characters, or until
they’re found to be equal. In order to offer the same sort of functionality with string
data, we need to turn strings into numbers. In this section, we’ll talk about methods of
converting strings into numbers that can be used with Redis ZSETs in order to sort
based on string prefixes. After reading this section, you should be able to sort your
ZSET search results with strings.

Our first step in translating strings into numbers is understanding the limitations
of what we can do. Because Redis uses IEEE 754 double-precision floating-point values
to store scores, we’re limited to at most 64 bits worth of storage. Due to some subtleties
in the way doubles represent numbers, we can’t use all 64 bits. Technically, we
could use more than the equivalent of 63 bits, but that doesn’t buy us significantly more than 63 bits, and for our case, we’ll only use 48 bits for the sake of simplicity.
Using 48 bits limits us to prefixes of 6 bytes on our data, which is often sufficient.

To convert our string into an integer, we’ll trim our string down to six characters (as
necessary), converting each character into its ASCII value. We’ll then extend the values
to six entries for strings shorter than six characters. Finally, we’ll combine all of the values
into an integer. Our code for converting a string into a score can be seen next.

Listing 7.8A function to turn a string into a numeric score
def string_to_score(string, ignore_case=False):
    if ignore_case:
        string = string.lower()

We can handle optional caseinsensitive indexes easily, so we will.

    pieces = map(ord, string[:6])

Convert the first six characters of the string into their numeric values, null being 0, tab being 9, capital A being 65, and so on.

    while len(pieces) < 6:

For strings that aren’t at least six characters long, we’ll add placeholder values to represent that the strings are short.

    score = 0
    for piece in pieces:
        score = score * 257 + piece + 1

For each value in the converted string values, we add it to the score, taking into consideration that a null is different from a placeholder.

    return score * 2 + (len(string) > 6)

Because we have an extra bit, we can also signify whether the string is exactly six characters or more, allowing us to differentiate “robber” and “robbers”, though not “robbers” and “robbery”.

Most of our string_to_score() function should be straightforward, except for
maybe our use of -1 as a filler value for strings shorter than six characters, and our use
of 257 as a multiplier before adding each character value to the score. For many applications,
being able to differentiate between hello\0 and hello can be important, so
we take steps to differentiate the two, primarily by adding 1 to all ASCII values (making
null 1), and using 0 (-1 + 1) as a filler value for short strings. As a bonus, we use an
extra bit to tell us whether a string is more than six characters long, which helps us
with similar six-character prefixes.2

By mapping strings to scores, we’re able to get a prefix comparison of a little more
than the first six characters of our string. For non-numeric data, this is more or less
what we can reasonably do without performing extensive numeric gymnastics, and
without running into issues with how a non-Python library transfers large integers
(that may or may not have been converted to a double).

When using scores derived from strings, the scores aren’t always useful for combining
with other scores and are typically only useful for defining a single sort order.

Exercise: Autocompleting with strings as scores

Back in section 6.1.2, we used ZSETs with scores set to 0 to allow us to perform
prefix matching on user names. We had to add items to the ZSET and either use
WATCH/MULTI/EXEC or the lock that we covered in section 6.2 to make sure that we
fetched the correct range. But if instead we added names with scores being the result
of string_to_score() on the name itself, we could bypass the use of WATCH/
MULTI/EXEC and locks when someone is looking for a prefix of at most six characters
by using ZRANGEBYSCORE, with the endpoints we had calculated before being converted
into scores as just demonstrated. Try rewriting our find_prefix_range() and
autocomplete_on_prefix() functions to use ZRANGEBYSCORE instead.

Exercise: Autocompleting with longer strings

In this section and for the previous exercise, we converted arbitrary binary strings to
scores, which limited us to six-character prefixes. By reducing the number of valid
characters in our input strings, we don’t need to use a full 8+ bits per input character.
Try to come up with a method that would allow us to use more than a six-character
prefix if we only needed to autocomplete on lowercase alphabetic characters.

that this is because the score that we produced from the string doesn’t really mean
anything, aside from defining a sort order.

Now that we can sort on arbitrary data, and you’ve seen how to use weights to
adjust and combine numeric data, you’re ready to read about how we can use Redis
SETs and ZSETs to target ads.

2 If we didn’t care about differentiating between hello\0 and hello, then we wouldn’t need the filler. If we
didn’t need the filler, we could replace our multiplier of 257 with 256 and get rid of the +1 adjustment. But
with the filler, we actually use .0337 additional bits to let us differentiate short strings from strings that have
nulls. And when combined with the extra bit we used to distinguish whether we have strings longer than six
characters, we actually use 49.0337 total bits.