Redis Enterprise

Technology

A Foundational Look at Redis Enterprise

RediSearch

RediSearch is a powerful text search and secondary indexing engine, built on top of Redis as a Redis Module.

Unlike Redis search libraries, it does not use Redis’ internal data structures. Using its own highly optimized data structures and algorithms, it allows for advanced search features, with high performance and a low memory footprint. It can perform simple text searches as well as complex structured queries, filtering by numeric properties and geographical distances.

RediSearch supports continuous indexing with no performance degradation, maintaining concurrent loads of both querying and indexing. This makes it ideal for searching frequently updated databases, without the need for batch indexing and service interrupts.

The Enterprise version of RediSearch supports scaling across many servers, allowing it to easily grow to billions of documents on hundreds of servers.

All of this is done while taking advantage of Redis’ robust architecture and infrastructure, utilizing Redis’ protocol, replication, persistence and clustering. RediSearch is powerful yet simple to manage and maintain, and can serve as a standalone database, or augment existing Redis databases with advanced, powerful indexing capabilities.

RediSearch is extremely fast compared to other open-source search engines. The chart below shows one result from a benchmark we conducted using Wikipedia’s pages, containing about 5.1 million short abstracts. More info about this benchmark can be found in the RediSearch whitepaper:

 

RediSearch is an open-source project under the AGPL license. The RediSearch repository can be found on GitHub, detailed documentation can be found on RedisSearch.io and a quickstart tutorial can be found in the Redis Enterprise documentation.

Main Features

  • Full-text indexing of multiple fields in document, including:
    • Exact phrase matching
    • Stemming in many languages
    • Chinese tokenization support
    • Prefix queries
    • Optional, negative and union queries
  • Distributed search on billions of documents
  • Numeric property indexing
  • Geographical indexing and radius filters
  • Incremental indexing without performance loss
  • A structured query language for advanced queries:
    • Unions and intersections
    • Optional and negative queries
    • Tag filtering
    • Prefix matching
  • A powerful auto-complete engine with fuzzy matching
  • Multiple scoring models and sorting by values
  • Concurrent low-latency insertion and updates of documents
  • Concurrent searches allowing long running queries without blocking Redis
  • An extension mechanism allowing custom scoring models and query extension
  • Support for indexing existing Hash objects in Redis databases
  • Support clustering through a coordinator entity
    Note: clustering is only available in RediSearch’s Enterprise version  

 

Indexing Documents

In order to search effectively, RediSearch needs to know how to index documents. A document may have several fields, each with its own weight (e.g. a title is usually more important than the text itself). The engine can also use numeric or geographical fields for filtering. Hence, the first step is to create the index definition, which tells RediSearch how to treat the documents we will add. For example, to define an index of products, indexing their title, description, brand and price, the index creation would look like:

 

FT.CREATE my_index SCHEMA
title TEXT WEIGHT 5
description TEXT
brand TEXT
price NUMERIC

 

Adding a document to the index would be accomplished by executing a command like:

 

FT.ADD my_index doc1 1.0 FIELDS
title "Acme 42 inch LCD TV"
description "42 inch Full-HD tv with smart tv capabilities"
brand "Acme"
price 300

 

This code tells RediSearch to take the document, break each field into its terms (“tokenization”) and index it, marking the index for each of the terms in the index as contained in this document. The product is added to the index immediately and can now be found in future searches.

Searching

Now that we have added products to our index, searching is very simple:

FT.SEARCH my_index "full hd tv"

 

This will tell RediSearch to intersect the lists of documents for each term, and return all documents containing the three terms. Of course more complex queries can be performed, and the full syntax of the query language is summarized below.

Data Structures

RediSearch uses its own custom data structures, and uses Redis’ native structures only for storing the actual document content (with Hashes).

Using specialized data structures allows for faster searching and a more memory-effective storage for index records, utilizing compression techniques like delta encoding.

Query Language

RediSearch supports a simple syntax for complex queries that can be combined together to express rich filtering and matching rules. The query is combined as a text string in the `FT.SEARCH` request and is parsed using a complex query parser.

The query language includes the following expressions:

  • Multi-word phrases are simply a list of tokens (e.g. foo bar baz) and imply intersection (AND) of the terms.
  • Exact phrases are wrapped in quotes (e.g “hello world”).
  • OR Unions (i.e word1 OR word2) are expressed with a pipe (|) (e.g. hello|hallo|shalom|hola).
  • Negation (i.e. word1 NOT word2) of expressions or sub-queries, (e.g. hello -world).
  • Prefix matches (all terms starting with a prefix) are expressed with a star (*), following a 3-letter or longer prefix.
  • Selection of specific fields using the syntax @field:hello world.
  • Numeric Range matches on numeric fields with the syntax @field:[{min} {max}].
  • Geo-radius matches on geo fields with the syntax @field:[{lon} {lat} {radius} {m|km|mi|ft}].
  • Tag field filters with the syntax @field:{tag | tag | …}.
  • Optional terms or clauses, e.g. foo ~bar, meaning “bar” is optional but documents with “bar” in them will rank higher.

Complex Query Examples

Expressions can be combined together to express complex rules. For example, let’s assume we have a database of products, where each entity has the fields `title`, `brand`, `tags` and `price`.

Expressing a generic search would be simply:

lcd tv

This would return documents containing these terms in any field. Limiting the search to specific fields (title only in this case) is expressed as:

@title:(lcd tv)

Numeric filters can be combined to filter price within a price range:

@title:(lcd tv)
@price:[100 500.2]

Multiple text fields can be accessed in different query clauses, for example to select products of multiple brands:

@title:(lcd tv)
@brand:(sony | samsung | lg)
@price:[100 500.2]

Tag fields can be used to index multi-term properties without actual full-text tokenization:

@title:(lcd tv)
@brand:(sony | samsung | lg)
@tags:{42 inch | smart tv}
@price:[100 500.2]

And negative clauses can also be added, in this example to filter out plasma and CRT TVs:

@title:(lcd tv)
@brand:(sony | samsung | lg)
@tags:{42 inch | smart tv}
@price:[100 500.2]
-@tags:{plasma | crt}

Scoring Model

RediSearch comes with a few very basic scoring functions with which to evaluate document relevance. They are all based on document scores and term frequency, regardless of the ability to use sortable fields (see below). Scoring functions are specified by adding the `SCORER {scorer_name}` argument to a search request.

If you prefer a custom scoring function, it is possible to add more functions using the [Extension API](/Extensions).

These are the pre-bundled scoring functions available in RediSearch:

  • TFIDF (Default) — Basic TF-IDF scoring with document score and proximity boosting factored in.
  • TFIDF.DOCNORM — Identical to the default TFIDF scorer, with one important distinction. Term frequencies are normalized by the length of the document (in number of terms) and not by the most frequent term. This is useful when the documents are very short.   
  • BM25 — A variation on the basic TF-IDF scorer.
  • DISMAX — A simple scorer that sums up the frequencies of the matched terms. In the case of union clauses, it gives the maximum value of those matches.
  • DOCSCORE — A scoring function that just returns the priority score of the document without applying any calculations to it. Since document scores can be updated, this is useful if you’d like to use an external score and nothing further.

Sortable Fields

It is possible to bypass the scoring function mechanism and order search results by the value of different document properties (fields) directly, even if the sorting field is not used by the query. For example, you can search for first name and sort by last name.

When creating the index with FT.CREATE, you can declare TEXT and NUMERIC properties to be SORTABLE. When a property is sortable, we can later decide to order the results by its values. For example, the following schema:

FT.CREATE users
   SCHEMA
      first_name TEXT
      last_name TEXT SORTABLE
      age NUMERIC SORTABLE

 

…would allow the following query:

 

FT.SEARCH users "john lennon" SORTBY age DESC

 

Result Highlighting and Summarization

Highlighting allows users to return only the relevant portions of a document matching a search query. This allows users to quickly see how a document relates to their query, with the search terms highlighted, usually in bold letters.

RediSearch implements high performance highlighting and summarization algorithms using the following syntax:

FT.SEARCH ...
  SUMMARIZE [FIELDS {num} {field}] [FRAGS {numFrags}]
            [LEN {fragLen}] [SEPARATOR {separator}]
  HIGHLIGHT [FIELDS {num} {field}] [TAGS {openTag} {closeTag}]

 

Summarization will fragment the text into smaller sized snippets; each snippet will contain the found term(s) and some additional surrounding context.

Highlighting will highlight the searched term (and its variants) with a user-defined tag. This may be used to display the matched text in a different typeface using a markup language, or to otherwise make the text appear differently.

Auto Completion

Another important feature for RediSearch is its auto-complete engine, which allows users to create dictionaries of weighted terms, and then query them for suggested completions to a given user prefix. Completions can have “payloads,” a user provided piece of data that can be used for display. For example, when completing the names of users, it is possible to add extra metadata about users to be displayed— if a user starts to enter the term “lcd tv” into a search, sending the prefix “lc” will return the full term as a result. The dictionary is modeled as a compact Trie (prefix tree) with weights, which is traversed to find the top suffixes of a prefix.

RediSearch also allows for fuzzy suggestions, meaning you can get suggestions to prefixes even if the user makes a typo in their prefix. This is enabled using a Levenshtein Automaton, allowing efficient searching of the dictionary for all terms within a maximal Levenshtein Distance of a term or prefix. Then suggestions are weighted based on both their original score and their distance from the prefix typed by the user. However, searching for fuzzy prefixes (especially very short ones) will traverse an enormous number of suggestions. In fact, fuzzy suggestions for any single letter will traverse the entire dictionary, so we recommend using this feature carefully and considering the performance penalty it incurs.

Note: RediSearch’s auto-complete supports unicode, allowing for fuzzy matches in non-latin languages as well.

 

Extension Model

RediSearch supports an extension mechanism, much like Redis supports modules. The API is very minimal at the time of writing, and it does not yet support dynamic loading of extensions in runtime. Instead, extensions must be written in C (or a language that has an interface with C) and compiled into dynamic libraries that will be loaded at runtime.

There are currently two kinds of extension APIs:

  1. Query Expanders, whose role is to expand query tokens (i.e. stemmers)
  2. Scoring Functions, whose role is to rank search results at query time

Extensions are compiled into dynamic libraries, and loaded into RediSearch on initialization of the module. In fact, the mechanism is based on the code of Redis’ own module system, albeit far simpler.

 

Scalable Distributed Search

While RediSearch is very fast and memory efficient, if an index is big enough, at some point it will be too slow or consume too much memory for a single machine. Then it will have to be scaled out and partitioned over several machines, each of which will hold a small part of the complete search index.

Traditional clusters map different keys to different “shards” to achieve this. However, in search indexes this approach is not practical. If we mapped each word’s index to a different shard, we would end up needing to intersect records from different servers for multi-term queries.

The way to address this challenge is to employ a technique called index partitioning, which is very simple at its core:

  • The index is split across many machines/partitions by document ID. Every such partition has a complete index of all the documents mapped to it.
  • We query all shards concurrently and merge the results from all of them into a single result.

To enable that, a new component called a “Coordinator” is added to the cluster. When searching for documents, the Coordinator receives the query and sends it to N partitions, each holding a sub index of 1/N documents. Since we’re only interested in the top K results of all partitions, each partition returns just its own top K results. We then merge the N lists of K elements and extract the top K elements from the merged list.

The following figure illustrates the Coordinator operation:

Note: This feature is currently only available in the Enterprise version of RediSearch.


Next Section  ►  Integrated Modules