EBOOK – REDIS IN ACTION

This book covers the use of Redis, an in-memory database/data structure server.

  • Foreword
  • Preface
  • Acknowledgments
  • About this Book
  • About the Cover Illustration
  • Part 1: Getting Started
  • Part 2: Core concepts
  • Part 3: Next steps
  • Appendix A
  • Appendix B
  • Buy the paperback

    7.1.1 Basic search theory

    Searching for words in documents faster than scanning over them requires preprocessing
    the documents in advance. This preprocessing step is generally known as
    indexing, and the structures that we create are called inverted indexes. In the search
    world, inverted indexes are well known and are the underlying structure for almost
    every search engine that we’ve used on the internet. In a lot of ways, we can think of
    this process as producing something similar to the index at the end of this book. We
    create inverted indexes in Redis primarily because Redis has native structures that are
    ideally suited for inverted indexes: the SET and ZSET.1

    More specifically, an inverted index of a collection of documents will take the
    words from each of the documents and create tables that say which documents contain
    what words. So if we have two documents, docA and docB, containing just the
    titles lord of the rings and lord of the dance, respectively, we’ll create a SET in Redis for lord
    that contains both docA and docB. This signifies that both docA and docB contain the
    word lord. The full inverted index for our two documents appears in figure 7.1.

    Figure 7.1The inverted index for docA and docB

    Knowing that the ultimate result of our index operation is a collection of Redis SETs is
    helpful, but it’s also important to know how we get to those SETs.

    BASIC INDEXING

    Figure 7.2The process of tokenizing text into words, then removing stop words, as run on a paragraph from an early version of this section

    In order to construct our SETs of documents,
    we must first examine our documents for words. The process of extracting words from documents is known as parsing and tokenization; we’re producing a set of tokens (or words) that identify the document.

    There are many different ways of producing tokens. The methods used for web pages could be different from methods used for rows in a relational database, or from documents from a document store. We’ll keep it simple and consider words of alphabetic characters and apostrophes (‘) that are at least two characters long. This accounts for the majority of words in the English language, except for I and a, which we’ll ignore.

    One common addition to a tokenization process is the removal of words known as stop words. Stop words are words that occur so frequently in documents that they don’t offer a substantial amount of information, and searching for those words returns too many documents to be useful. By removing stop words, not only can we improve the performance of our searches, but we can also reduce the size of our index. Figure 7.2 shows the process of tokenization and stop word removal.

    One challenge in this process is coming up with a useful list of stop words. Every
    group of documents will have different statistics on what words are most common,
    which may or may not affect stop words. As part of listing 7.1, we include a list of stop
    words (fetched from http://www.textfixer.com/resources/), as well as functions to
    both tokenize and index a document, taking into consideration the stop words that
    we want to remove.

    Listing 7.1Functions to tokenize and index a document
    STOP_WORDS = set('''able about across after all almost also am among an and any are as at be because been but by can cannot could dear did do does either else ever every for from get got had has have he her hers him his how however if in into is it its just least let like likely may me might most must my neither no nor not of off often on only or other our own rather said say says she should since so some than that the their them then there these they this tis to too twas us wants was we were what when where which while who whom why will with
    
    would yet you your'''.split())
    
    

    We predeclare our known stop words; these were fetched from http://www.textfixer.com/resources/.

    WORDS_RE = re.compile("[a-z']{2,}")
    
    

    A regular expression that extracts words as we defined them.

    def tokenize(content):
    
        words = set()
    

    Our Python set of words that we have found in the document content.

        for match in WORDS_RE.finditer(content.lower()):
    

    Iterate over all of the words in the content.

            word = match.group().strip("'")
    

    Strip any leading or trailing single-quote characters.

            if len(word) >= 2:
                words.add(word)
    

    Keep any words that are still at least two characters long.

        return words - STOP_WORDS
    
    

    Return the set of words that remain that are also not stop words.

    def index_document(conn, docid, content):
    
        words = tokenize(content)
    
    

    Get the tokenized words for the content.

        pipeline = conn.pipeline(True)
    
        for word in words:
            pipeline.sadd('idx:' + word, docid)
    

    Add the documents to the appropriate inverted index entries.

        return len(pipeline.execute())
    

    Return the number of unique non-stop words that were added for the document.

    If we were to run our earlier docA and docB examples through the updated tokenization
    and indexing step in listing 7.1, instead of having the five SETs corresponding to
    lord, of, the, rings, and dance, we’d only have lord, rings, and dance, because of
    and the are both stop words.

    REMOVING A DOCUMENT FROM THE INDEXIf you’re in a situation where your
    document may have its content changed over time, you’ll want to add functionality
    to automatically remove the document from the index prior to reindexing
    the item, or a method to intelligently update only those inverted
    indexes that the document should be added or removed from. A simple way
    of doing this would be to use the SET command to update a key with a JSONencoded
    list of words that the document had been indexed under, along with
    a bit of code to un-index as necessary at the start of index_document().

    Now that we have a way of generating inverted indexes for our knowledge base documents, let’s look at how we can search our documents.

    BASIC SEARCHING

    Searching the index for a single word is easy: we fetch the set of documents in that
    word’s SET. But what if we wanted to find documents that contained two or more
    words? We could fetch the SETs of documents for all of those words, and then try to
    find those documents that are in all of the SETs, but as we discussed in chapter 3,
    there are two commands in Redis that do this directly: SINTER and SINTERSTORE. Both
    commands will discover the items that are in all of the provided SETs and, for our
    example, will discover the SET of documents that contain all of the words.

    One of the amazing things about using inverted indexes with SET intersections is
    not so much what we can find (the documents we’re looking for), and it’s not even
    how quickly it can find the results—it’s how much information the search completely
    ignores. When searching text the way a text editor does, a lot of useless data gets
    examined. But with inverted indexes, we already know what documents have each
    individual word, and it’s only a matter of filtering through the documents that have all
    of the words we’re looking for.

    Sometimes we want to search for items with similar meanings and have them considered
    the same, which we call synonyms (at least in this context). To handle that situation,
    we could again fetch all of the document SETs for those words and find all
    of the unique documents, or we could use another built-in Redis operation: SUNION
    or SUNIONSTORE.

    Occasionally, there are times when we want to search for documents with certain
    words or phrases, but we want to remove documents that have other words. There are
    also Redis SET operations for that: SDIFF and SDIFFSTORE.

    With Redis SET operations and a bit of helper code, we can perform arbitrarily intricate
    word queries over our documents. Listing 7.2 provides a group of helper functions
    that will perform SET intersections, unions, and differences over the given words, storing
    them in temporary SETs with an expiration time that defaults to 30 seconds.

    Listing 7.2SET intersection, union, and difference operation helpers
    def _set_common(conn, method, names, ttl=30, execute=True):
    
        id = str(uuid.uuid4())
    

    Create a new temporary identifier.

        pipeline = conn.pipeline(True) if execute else conn
    

    Set up a transactional pipeline so that we have consistent results for each individual call.

        names = ['idx:' + name for name in names]
    

    Add the ‘idx:’ prefix to our terms.

        getattr(pipeline, method)('idx:' + id, *names)
    

    Set up the call for one of the operations.

        pipeline.expire('idx:' + id, ttl)
    

    Instruct Redis to expire the SET in the future.

        if execute:
    
            pipeline.execute()
    

    Actually execute the operation.

        return id
    

    Return the ID for the caller to process the results.

        def intersect(conn, items, ttl=30, _execute=True):
            return _set_common(conn, 'sinterstore', items, ttl, _execute)
    
    

    Helper function to perform SET intersections.

        def union(conn, items, ttl=30, _execute=True):
            return _set_common(conn, 'sunionstore', items, ttl, _execute)
    
    

    Helper function to perform SET unions.

        def difference(conn, items, ttl=30, _execute=True):
            return _set_common(conn, 'sdiffstore', items, ttl, _execute)
    

    Helper function to perform SET differences.

    Each of the intersect(), union(), and difference() functions calls another helper
    function that actually does all of the heavy lifting. This is because they all essentially
    do the same thing: set up the keys, make the appropriate SET call, update the expiration,
    and return the new SET’s ID. Another way of visualizing what happens when we
    perform the three different SET operations SINTER, SUNION, and SDIFF can be seen in
    figure 7.3, which shows the equivalent operations on Venn diagrams.

    This is everything necessary for programming the search engine, but what about
    parsing a search query?

    PARSING AND EXECUTING A SEARCH

    We almost have all of the pieces necessary to perform indexing and search. We have
    tokenization, indexing, and the basic functions for intersection, union, and differences.
    The remaining piece is to take a text query and turn it into a search operation.
    Like our earlier tokenization of documents, there are many ways to tokenize search
    queries. We’ll use a method that allows for searching for documents that contain all of
    the provided words, supporting both synonyms and unwanted words.

    A basic search will be about finding documents that contain all of the provided
    words. If we have just a list of words, that’s a simple intersect() call. In the case
    where we want to remove unwanted words, we’ll say that any word with a leading
    minus character (-) will be removed with difference(). To handle synonyms, we

    Figure 7.3The SET intersection, union, and difference calls as if they were operating on Venn diagrams

    need a way of saying “This word is a synonym,” which we’ll denote by the use of the
    plus (+) character prefix on a word. If we see a plus character leading a word, it’s considered
    a synonym of the word that came just before it (skipping over any unwanted
    words), and we’ll group synonyms together to perform a union() prior to the higherlevel
    intersect() call.

    Putting it all together where + denotes synonyms and – denotes unwanted words, the
    next listing shows the code for parsing a query into a Python list of lists that describes
    words that should be considered the same, and a list of words that are unwanted.

    Listing 7.3A function for parsing a search query
    QUERY_RE = re.compile("[+-]?[a-z']{2,}")
    
    

    Our regular expression for finding wanted, unwanted, and synonym words.

    def parse(query):
    
        unwanted = set()
    

    A unique set of unwanted words.

        all = []
    

    Our final result of words that we’re looking to intersect.

        current = set()
    

    The current unique set of words to consider as synonyms.

        for match in QUERY_RE.finditer(query.lower()):
    

    Iterate over all words in the search query.

            word = match.group()
            prefix = word[:1]
            if prefix in '+-':
                word = word[1:]
            else:
                prefix = None
    
    

    Discover +/- prefixes, if any.

            word = word.strip("'")
            if len(word) < 2 or word in STOP_WORDS:
                continue
    
    

    Strip any leading or trailing single quotes, and skip anything that’s a stop word.

            if prefix == '-':
                unwanted.add(word)
                continue
    
    

    If the word is unwanted, add it to the unwanted set.

            if current and not prefix:
                all.append(list(current))
                current = set()
    

    Set up a new synonym set if we have no synonym prefix and we already have words.

            current.add(word)
    
    

    Add the current word to the current set.

        if current:
            all.append(list(current))
    
    

    Add any remaining words to the final intersection.

        return all, list(unwanted)
    

    To give this parsing function a bit of exercise, let’s say that we wanted to search our
    knowledge base for chat connection issues. What we really want to search for is an article
    with connect, connection, disconnect, or disconnection, along with chat, but
    because we aren’t using a proxy, we want to skip any documents that include proxy or
    proxies. Here’s an example interaction that shows the query (formatted into groups
    for easier reading):

    >>> parse('''
    connect +connection +disconnect +disconnection
    chat
    -proxy -proxies''')
    ([['disconnection', 'connection', 'disconnect', 'connect'], ['chat']],
    ['proxies', 'proxy'])
    >>>
    

    Our parse function properly extracted the synonyms for connect/disconnect, kept
    chat separate, and discovered our unwanted proxy and proxies. We aren’t going to
    be passing that parsed result around (except for maybe debugging as necessary), but
    instead are going to be calling our parse() function as part of a parse_and_search()
    function that union()s the individual synonym lists as necessary, intersect()ing the
    final list, and removing the unwanted words with difference() as necessary. The full
    parse_and_search() function is shown in the next listing.

    Listing 7.4A function to parse a query and search documents
    def parse_and_search(conn, query, ttl=30):
    
        all, unwanted = parse(query)
    

    Parse the query.

        if not all:
            return None
    
    

    If there are only stop words, we don’t have a result.

        to_intersect = []
    
        for syn in all:
    

    Iterate over each list of synonyms.

            if len(syn) > 1:
                to_intersect.append(union(conn, syn, ttl=ttl))
    

    If the synonym list is more than one word long, perform the union operation.

            else:
                to_intersect.append(syn[0])
    
    

    Otherwise, use the individual word directly.

        if len(to_intersect) > 1:
            intersect_result = intersect(conn, to_intersect, ttl=ttl)
    

    If we have more than one word/result to intersect, intersect them.

        else:
            intersect_result = to_intersect[0]
    
    

    Otherwise, use the individual word/result directly.

        if unwanted:
            unwanted.insert(0, intersect_result)
            return difference(conn, unwanted, ttl=ttl)
    
    

    If we have any unwanted words, remove them from our earlier result and return it.

        return intersect_result
    

    Otherwise, return the intersection result.

    Like before, the final result will be the ID of a SET that includes the documents that
    match the parameters of our search. Assuming that Fake Garage Startup has properly
    indexed all of their documents using our earlier index_document() function,
    parse_and_search() will perform the requested search.

    We now have a method that’s able to search for documents with a given set of criteria.
    But ultimately, when there are a large number of documents, we want to see them
    in a specific order. To do that, we need to learn how to sort the results of our searches.

    1 Though SETs and ZSETs could be emulated with a properly structured table and unique index in a relational
    database, emulating a SET or ZSET intersection, union, or difference using SQL for more than a few terms is
    cumbersome.