7.1.1 Basic search theory
Searching for words in documents faster than scanning over them requires preprocessing
the documents in advance. This preprocessing step is generally known as
indexing, and the structures that we create are called inverted indexes. In the search
world, inverted indexes are well known and are the underlying structure for almost
every search engine that we’ve used on the internet. In a lot of ways, we can think of
this process as producing something similar to the index at the end of this book. We
create inverted indexes in Redis primarily because Redis has native structures that are
ideally suited for inverted indexes: the SET and ZSET.1
More specifically, an inverted index of a collection of documents will take the
words from each of the documents and create tables that say which documents contain
what words. So if we have two documents, docA and docB, containing just the
titles lord of the rings and lord of the dance, respectively, we’ll create a SET in Redis for lord
that contains both docA and docB. This signifies that both docA and docB contain the
word lord. The full inverted index for our two documents appears in figure 7.1.
Knowing that the ultimate result of our index operation is a collection of Redis SETs is
helpful, but it’s also important to know how we get to those SETs.
In order to construct our SETs of documents,
we must first examine our documents for words. The process of extracting words from documents is known as parsing and tokenization; we’re producing a set of tokens (or words) that identify the document.
There are many different ways of producing tokens. The methods used for web pages could be different from methods used for rows in a relational database, or from documents from a document store. We’ll keep it simple and consider words of alphabetic characters and apostrophes (‘) that are at least two characters long. This accounts for the majority of words in the English language, except for I and a, which we’ll ignore.
One common addition to a tokenization process is the removal of words known as stop words. Stop words are words that occur so frequently in documents that they don’t offer a substantial amount of information, and searching for those words returns too many documents to be useful. By removing stop words, not only can we improve the performance of our searches, but we can also reduce the size of our index. Figure 7.2 shows the process of tokenization and stop word removal.
One challenge in this process is coming up with a useful list of stop words. Every
group of documents will have different statistics on what words are most common,
which may or may not affect stop words. As part of listing 7.1, we include a list of stop
words (fetched from http://www.textfixer.com/resources/), as well as functions to
both tokenize and index a document, taking into consideration the stop words that
we want to remove.
If we were to run our earlier docA and docB examples through the updated tokenization
and indexing step in listing 7.1, instead of having the five SETs corresponding to
lord, of, the, rings, and dance, we’d only have lord, rings, and dance, because of
and the are both stop words.
REMOVING A DOCUMENT FROM THE INDEXIf you’re in a situation where your
document may have its content changed over time, you’ll want to add functionality
to automatically remove the document from the index prior to reindexing
the item, or a method to intelligently update only those inverted
indexes that the document should be added or removed from. A simple way
of doing this would be to use the SET command to update a key with a JSONencoded
list of words that the document had been indexed under, along with
a bit of code to un-index as necessary at the start of index_document().
Now that we have a way of generating inverted indexes for our knowledge base documents, let’s look at how we can search our documents.
Searching the index for a single word is easy: we fetch the set of documents in that
word’s SET. But what if we wanted to find documents that contained two or more
words? We could fetch the SETs of documents for all of those words, and then try to
find those documents that are in all of the SETs, but as we discussed in chapter 3,
there are two commands in Redis that do this directly: SINTER and SINTERSTORE. Both
commands will discover the items that are in all of the provided SETs and, for our
example, will discover the SET of documents that contain all of the words.
One of the amazing things about using inverted indexes with SET intersections is
not so much what we can find (the documents we’re looking for), and it’s not even
how quickly it can find the results—it’s how much information the search completely
ignores. When searching text the way a text editor does, a lot of useless data gets
examined. But with inverted indexes, we already know what documents have each
individual word, and it’s only a matter of filtering through the documents that have all
of the words we’re looking for.
Sometimes we want to search for items with similar meanings and have them considered
the same, which we call synonyms (at least in this context). To handle that situation,
we could again fetch all of the document SETs for those words and find all
of the unique documents, or we could use another built-in Redis operation: SUNION
Occasionally, there are times when we want to search for documents with certain
words or phrases, but we want to remove documents that have other words. There are
also Redis SET operations for that: SDIFF and SDIFFSTORE.
With Redis SET operations and a bit of helper code, we can perform arbitrarily intricate
word queries over our documents. Listing 7.2 provides a group of helper functions
that will perform SET intersections, unions, and differences over the given words, storing
them in temporary SETs with an expiration time that defaults to 30 seconds.
Each of the intersect(), union(), and difference() functions calls another helper
function that actually does all of the heavy lifting. This is because they all essentially
do the same thing: set up the keys, make the appropriate SET call, update the expiration,
and return the new SET’s ID. Another way of visualizing what happens when we
perform the three different SET operations SINTER, SUNION, and SDIFF can be seen in
figure 7.3, which shows the equivalent operations on Venn diagrams.
This is everything necessary for programming the search engine, but what about
parsing a search query?
PARSING AND EXECUTING A SEARCH
We almost have all of the pieces necessary to perform indexing and search. We have
tokenization, indexing, and the basic functions for intersection, union, and differences.
The remaining piece is to take a text query and turn it into a search operation.
Like our earlier tokenization of documents, there are many ways to tokenize search
queries. We’ll use a method that allows for searching for documents that contain all of
the provided words, supporting both synonyms and unwanted words.
A basic search will be about finding documents that contain all of the provided
words. If we have just a list of words, that’s a simple intersect() call. In the case
where we want to remove unwanted words, we’ll say that any word with a leading
minus character (-) will be removed with difference(). To handle synonyms, we
need a way of saying “This word is a synonym,” which we’ll denote by the use of the
plus (+) character prefix on a word. If we see a plus character leading a word, it’s considered
a synonym of the word that came just before it (skipping over any unwanted
words), and we’ll group synonyms together to perform a union() prior to the higherlevel
Putting it all together where + denotes synonyms and – denotes unwanted words, the
next listing shows the code for parsing a query into a Python list of lists that describes
words that should be considered the same, and a list of words that are unwanted.
To give this parsing function a bit of exercise, let’s say that we wanted to search our
knowledge base for chat connection issues. What we really want to search for is an article
with connect, connection, disconnect, or disconnection, along with chat, but
because we aren’t using a proxy, we want to skip any documents that include proxy or
proxies. Here’s an example interaction that shows the query (formatted into groups
for easier reading):
Our parse function properly extracted the synonyms for connect/disconnect, kept
chat separate, and discovered our unwanted proxy and proxies. We aren’t going to
be passing that parsed result around (except for maybe debugging as necessary), but
instead are going to be calling our parse() function as part of a parse_and_search()
function that union()s the individual synonym lists as necessary, intersect()ing the
final list, and removing the unwanted words with difference() as necessary. The full
parse_and_search() function is shown in the next listing.
Like before, the final result will be the ID of a SET that includes the documents that
match the parameters of our search. Assuming that Fake Garage Startup has properly
indexed all of their documents using our earlier index_document() function,
parse_and_search() will perform the requested search.
We now have a method that’s able to search for documents with a given set of criteria.
But ultimately, when there are a large number of documents, we want to see them
in a specific order. To do that, we need to learn how to sort the results of our searches.
1 Though SETs and ZSETs could be emulated with a properly structured table and unique index in a relational
database, emulating a SET or ZSET intersection, union, or difference using SQL for more than a few terms is