EBOOK – REDIS IN ACTION

This book covers the use of Redis, an in-memory database/data structure server.

open all | close all

7.1.1 Basic search theory

Searching for words in documents faster than scanning over them requires preprocessing
the documents in advance. This preprocessing step is generally known as
indexing, and the structures that we create are called inverted indexes. In the search
world, inverted indexes are well known and are the underlying structure for almost
every search engine that we’ve used on the internet. In a lot of ways, we can think of
this process as producing something similar to the index at the end of this book. We
create inverted indexes in Redis primarily because Redis has native structures that are
ideally suited for inverted indexes: the SET and ZSET.1

More specifically, an inverted index of a collection of documents will take the
words from each of the documents and create tables that say which documents contain
what words. So if we have two documents, docA and docB, containing just the
titles lord of the rings and lord of the dance, respectively, we’ll create a SET in Redis for lord
that contains both docA and docB. This signifies that both docA and docB contain the
word lord. The full inverted index for our two documents appears in figure 7.1.

Figure 7.1The inverted index for docA and docB

Knowing that the ultimate result of our index operation is a collection of Redis SETs is
helpful, but it’s also important to know how we get to those SETs.

BASIC INDEXING

Figure 7.2The process of tokenizing text into words, then removing stop words, as run on a paragraph from an early version of this section

In order to construct our SETs of documents,
we must first examine our documents for words. The process of extracting words from documents is known as parsing and tokenization; we’re producing a set of tokens (or words) that identify the document.

There are many different ways of producing tokens. The methods used for web pages could be different from methods used for rows in a relational database, or from documents from a document store. We’ll keep it simple and consider words of alphabetic characters and apostrophes (‘) that are at least two characters long. This accounts for the majority of words in the English language, except for I and a, which we’ll ignore.

One common addition to a tokenization process is the removal of words known as stop words. Stop words are words that occur so frequently in documents that they don’t offer a substantial amount of information, and searching for those words returns too many documents to be useful. By removing stop words, not only can we improve the performance of our searches, but we can also reduce the size of our index. Figure 7.2 shows the process of tokenization and stop word removal.

One challenge in this process is coming up with a useful list of stop words. Every
group of documents will have different statistics on what words are most common,
which may or may not affect stop words. As part of listing 7.1, we include a list of stop
words (fetched from http://www.textfixer.com/resources/), as well as functions to
both tokenize and index a document, taking into consideration the stop words that
we want to remove.

Listing 7.1Functions to tokenize and index a document
STOP_WORDS = set('''able about across after all almost also am among an and any are as at be because been but by can cannot could dear did do does either else ever every for from get got had has have he her hers him his how however if in into is it its just least let like likely may me might most must my neither no nor not of off often on only or other our own rather said say says she should since so some than that the their them then there these they this tis to too twas us wants was we were what when where which while who whom why will with
would yet you your'''.split())

We predeclare our known stop words; these were fetched from http://www.textfixer.com/resources/.

WORDS_RE = re.compile("[a-z']{2,}")

A regular expression that extracts words as we defined them.

def tokenize(content):
    words = set()

Our Python set of words that we have found in the document content.

    for match in WORDS_RE.finditer(content.lower()):

Iterate over all of the words in the content.

        word = match.group().strip("'")

Strip any leading or trailing single-quote characters.

        if len(word) >= 2:
            words.add(word)

Keep any words that are still at least two characters long.

    return words - STOP_WORDS

Return the set of words that remain that are also not stop words.

def index_document(conn, docid, content):
    words = tokenize(content)

Get the tokenized words for the content.

    pipeline = conn.pipeline(True)
    for word in words:
        pipeline.sadd('idx:' + word, docid)

Add the documents to the appropriate inverted index entries.

    return len(pipeline.execute())

Return the number of unique non-stop words that were added for the document.

If we were to run our earlier docA and docB examples through the updated tokenization
and indexing step in listing 7.1, instead of having the five SETs corresponding to
lord, of, the, rings, and dance, we’d only have lord, rings, and dance, because of
and the are both stop words.

REMOVING A DOCUMENT FROM THE INDEXIf you’re in a situation where your
document may have its content changed over time, you’ll want to add functionality
to automatically remove the document from the index prior to reindexing
the item, or a method to intelligently update only those inverted
indexes that the document should be added or removed from. A simple way
of doing this would be to use the SET command to update a key with a JSONencoded
list of words that the document had been indexed under, along with
a bit of code to un-index as necessary at the start of index_document().

Now that we have a way of generating inverted indexes for our knowledge base documents, let’s look at how we can search our documents.

BASIC SEARCHING

Searching the index for a single word is easy: we fetch the set of documents in that
word’s SET. But what if we wanted to find documents that contained two or more
words? We could fetch the SETs of documents for all of those words, and then try to
find those documents that are in all of the SETs, but as we discussed in chapter 3,
there are two commands in Redis that do this directly: SINTER and SINTERSTORE. Both
commands will discover the items that are in all of the provided SETs and, for our
example, will discover the SET of documents that contain all of the words.

One of the amazing things about using inverted indexes with SET intersections is
not so much what we can find (the documents we’re looking for), and it’s not even
how quickly it can find the results—it’s how much information the search completely
ignores. When searching text the way a text editor does, a lot of useless data gets
examined. But with inverted indexes, we already know what documents have each
individual word, and it’s only a matter of filtering through the documents that have all
of the words we’re looking for.

Sometimes we want to search for items with similar meanings and have them considered
the same, which we call synonyms (at least in this context). To handle that situation,
we could again fetch all of the document SETs for those words and find all
of the unique documents, or we could use another built-in Redis operation: SUNION
or SUNIONSTORE.

Occasionally, there are times when we want to search for documents with certain
words or phrases, but we want to remove documents that have other words. There are
also Redis SET operations for that: SDIFF and SDIFFSTORE.

With Redis SET operations and a bit of helper code, we can perform arbitrarily intricate
word queries over our documents. Listing 7.2 provides a group of helper functions
that will perform SET intersections, unions, and differences over the given words, storing
them in temporary SETs with an expiration time that defaults to 30 seconds.

Listing 7.2SET intersection, union, and difference operation helpers
def _set_common(conn, method, names, ttl=30, execute=True):
    id = str(uuid.uuid4())

Create a new temporary identifier.

    pipeline = conn.pipeline(True) if execute else conn

Set up a transactional pipeline so that we have consistent results for each individual call.

    names = ['idx:' + name for name in names]

Add the ‘idx:’ prefix to our terms.

    getattr(pipeline, method)('idx:' + id, *names)

Set up the call for one of the operations.

    pipeline.expire('idx:' + id, ttl)

Instruct Redis to expire the SET in the future.

    if execute:
        pipeline.execute()

Actually execute the operation.

    return id

Return the ID for the caller to process the results.

    def intersect(conn, items, ttl=30, _execute=True):
        return _set_common(conn, 'sinterstore', items, ttl, _execute)

Helper function to perform SET intersections.

    def union(conn, items, ttl=30, _execute=True):
        return _set_common(conn, 'sunionstore', items, ttl, _execute)

Helper function to perform SET unions.

    def difference(conn, items, ttl=30, _execute=True):
        return _set_common(conn, 'sdiffstore', items, ttl, _execute)

Helper function to perform SET differences.

Each of the intersect(), union(), and difference() functions calls another helper
function that actually does all of the heavy lifting. This is because they all essentially
do the same thing: set up the keys, make the appropriate SET call, update the expiration,
and return the new SET’s ID. Another way of visualizing what happens when we
perform the three different SET operations SINTER, SUNION, and SDIFF can be seen in
figure 7.3, which shows the equivalent operations on Venn diagrams.

This is everything necessary for programming the search engine, but what about
parsing a search query?

PARSING AND EXECUTING A SEARCH

We almost have all of the pieces necessary to perform indexing and search. We have
tokenization, indexing, and the basic functions for intersection, union, and differences.
The remaining piece is to take a text query and turn it into a search operation.
Like our earlier tokenization of documents, there are many ways to tokenize search
queries. We’ll use a method that allows for searching for documents that contain all of
the provided words, supporting both synonyms and unwanted words.

A basic search will be about finding documents that contain all of the provided
words. If we have just a list of words, that’s a simple intersect() call. In the case
where we want to remove unwanted words, we’ll say that any word with a leading
minus character (-) will be removed with difference(). To handle synonyms, we

Figure 7.3The SET intersection, union, and difference calls as if they were operating on Venn diagrams

need a way of saying “This word is a synonym,” which we’ll denote by the use of the
plus (+) character prefix on a word. If we see a plus character leading a word, it’s considered
a synonym of the word that came just before it (skipping over any unwanted
words), and we’ll group synonyms together to perform a union() prior to the higherlevel
intersect() call.

Putting it all together where + denotes synonyms and – denotes unwanted words, the
next listing shows the code for parsing a query into a Python list of lists that describes
words that should be considered the same, and a list of words that are unwanted.

Listing 7.3A function for parsing a search query
QUERY_RE = re.compile("[+-]?[a-z']{2,}")

Our regular expression for finding wanted, unwanted, and synonym words.

def parse(query):
    unwanted = set()

A unique set of unwanted words.

    all = []

Our final result of words that we’re looking to intersect.

    current = set()

The current unique set of words to consider as synonyms.

    for match in QUERY_RE.finditer(query.lower()):

Iterate over all words in the search query.

        word = match.group()
        prefix = word[:1]
        if prefix in '+-':
            word = word[1:]
        else:
            prefix = None

Discover +/- prefixes, if any.

        word = word.strip("'")
        if len(word) < 2 or word in STOP_WORDS:
            continue

Strip any leading or trailing single quotes, and skip anything that’s a stop word.

        if prefix == '-':
            unwanted.add(word)
            continue

If the word is unwanted, add it to the unwanted set.

        if current and not prefix:
            all.append(list(current))
            current = set()

Set up a new synonym set if we have no synonym prefix and we already have words.

        current.add(word)

Add the current word to the current set.

    if current:
        all.append(list(current))

Add any remaining words to the final intersection.

    return all, list(unwanted)

To give this parsing function a bit of exercise, let’s say that we wanted to search our
knowledge base for chat connection issues. What we really want to search for is an article
with connect, connection, disconnect, or disconnection, along with chat, but
because we aren’t using a proxy, we want to skip any documents that include proxy or
proxies. Here’s an example interaction that shows the query (formatted into groups
for easier reading):

>>> parse('''
connect +connection +disconnect +disconnection
chat
-proxy -proxies''')
([['disconnection', 'connection', 'disconnect', 'connect'], ['chat']],
['proxies', 'proxy'])
>>>

Our parse function properly extracted the synonyms for connect/disconnect, kept
chat separate, and discovered our unwanted proxy and proxies. We aren’t going to
be passing that parsed result around (except for maybe debugging as necessary), but
instead are going to be calling our parse() function as part of a parse_and_search()
function that union()s the individual synonym lists as necessary, intersect()ing the
final list, and removing the unwanted words with difference() as necessary. The full
parse_and_search() function is shown in the next listing.

Listing 7.4A function to parse a query and search documents
def parse_and_search(conn, query, ttl=30):
    all, unwanted = parse(query)

Parse the query.

    if not all:
        return None

If there are only stop words, we don’t have a result.

    to_intersect = []
    for syn in all:

Iterate over each list of synonyms.

        if len(syn) > 1:
            to_intersect.append(union(conn, syn, ttl=ttl))

If the synonym list is more than one word long, perform the union operation.

        else:
            to_intersect.append(syn[0])

Otherwise, use the individual word directly.

    if len(to_intersect) > 1:
        intersect_result = intersect(conn, to_intersect, ttl=ttl)

If we have more than one word/result to intersect, intersect them.

    else:
        intersect_result = to_intersect[0]

Otherwise, use the individual word/result directly.

    if unwanted:
        unwanted.insert(0, intersect_result)
        return difference(conn, unwanted, ttl=ttl)

If we have any unwanted words, remove them from our earlier result and return it.

    return intersect_result

Otherwise, return the intersection result.

Like before, the final result will be the ID of a SET that includes the documents that
match the parameters of our search. Assuming that Fake Garage Startup has properly
indexed all of their documents using our earlier index_document() function,
parse_and_search() will perform the requested search.

We now have a method that’s able to search for documents with a given set of criteria.
But ultimately, when there are a large number of documents, we want to see them
in a specific order. To do that, we need to learn how to sort the results of our searches.

1 Though SETs and ZSETs could be emulated with a properly structured table and unique index in a relational
database, emulating a SET or ZSET intersection, union, or difference using SQL for more than a few terms is
cumbersome.