Streaming Analytics with Probabilistic Data Structures

Loris Cro by Loris Cro

Despite the complex-sounding name, probabilistic data structures are relatively simple data structures that can be very useful for solving streaming-analytics problems at scale.

Probabilistic data structures make heavy use of concepts from information theory (such as hashing) to give you an approximate view of some characteristics of your data. The reason why you might be fine with an approximation versus a complete and precise description is because in exchange they save an enormous amount of storage space and deliver great performance compared to traditional data structures.

Streaming analytics

One situation where performance and space savings are extremely important is streaming analytics. Streaming analytics, as opposed to batch analytics, are about providing insight on a dataset of unbounded size that gets constantly streamed into the analytics engine. The unbounded nature of those datasets precludes the usage of some traditional approaches, making probabilistic data structures an invaluable tool in a data engineer’s toolbelt.

Let’s take a look at some practical examples where Redis can help you perform analytics at a huge scale using probabilistic data structures. In the following examples we will see HyperLogLog, which is part of Redis, and other data structures (Bloom filters, TopK, Count-min Sketch) which are available in RedisBloom, a Redis module developed by Redis Labs.

HyperLogLog

HyperLogLog (HLL) counts unique items in a stream, without having to remember the whole history. In fact, an HLL counter takes up to 12KB of memory regardless of how many items you put into it. Lastly, an HLL counter will have a standard error rate of 0.81%, which is a perfectly sustainable error rate for most streaming analytics use cases.

Adding new items

To add new items to an HLL counter you must use PFADD

PFADD home-uniques "73.23.4.3" "185.23.54.8" "user1@foo.com" "user2@bar.com" 
In the above example, we created an HLL counter to count unique views of a page on our website and added four entries to it: two anonymous users identified by IP and two known users identified by their email address.

To the HLL counter, all elements are strings without any specific meaning, so it’s up to you to give them the right structure. As an example, we might want to count unique pageviews per day, in which case we could structure our elements like this:

PFADD home-uniques "2020-01-01:185.23.54.8" "2020-01-01:user1@foo.com"

This way, two views from the same user on different days will produce different elements. Keep in mind that the length of each element won’t affect the amount of space consumed by the counter because each entry will get hashed before insertion. The same applies for the data structures presented next.

Counting

To get the count from an HLL counter you must use PFCOUNT:

PFCOUNT home-uniques

This command will return the total number of unique items present in the filter. More interestingly, when called on multiple counters at once, PFCOUNT will perform set union before counting, meaning that any element present in more than one counter will be counted only once.

PFCOUNT blogpost1-uniques blogpost2-uniques

This command will return the unique views counted from the union of the two counters. This is useful because by dividing that number by the sum of the individual unique pageviews you will also get a measure of correlation between the two pages.

HLL counters also support an operation called PFMERGE, which is basically the same operation that PFCOUNT performs when called on multiple counters (i.e., set union).

Bloom filters

Bloom filters answer set membership questions (i.e. “is this element part of the set?”). They also deliver extreme space savings but, unlike HLL, the amount of space they require depends on the number of elements you plan to add to them and the error rate you are willing to accept. To give you a rough idea, a Bloom filter sized for 1 million items (of any size) with a 1% error rate takes up approximately 1MB of storage space.

About the error rate

In HLL, the error rate is manifested in the total count being slightly off compared to the actual number. In Bloom filters, the error probability affects positive answers. In other words, when you ask a Bloom filter if it contains a given element, the answer is either “Probably yes” or “Definitely no.” 

Usage examples

Bloom filters can be used to answer such questions as:

  • Is this URL malicious?
  • Is this word contained in the document?
  • Has this URL already been crawled?
  • Was this entry already present in the stream?

Creating a Bloom filter

To create a Bloom filter with specific settings you must use BF.RESERVE:

BF.RESERVE banned-urls 0.001 100000

This command creates a filter with a 0.1% error rate for 100K elements. The command takes a few more optional arguments about auto-scaling. Auto-scaling is a RedisBloom feature that adds more memory to the filter automatically when the capacity limit is reached. We still have to specify a target capacity because there are performance implications behind the auto-scaling feature, meaning that you want to get the initial estimate right whenever possible and rely on auto-scaling only if necessary.

Adding elements

To add elements to a Bloom filter you must use BF.ADD or BF.MADD (to add multiple elements at once):

BF.ADD crawled-urls google.com facebook.com 
Very straightforward. If you want to add multiple elements at once, take a look at BF.MADD, and if you want to skip the BF.RESERVE step, you can either configure the default size for all filters, or use BF.INSERT.

Testing set membership

To test if an element is part of the set, you must use BF.EXISTS or  BF.MEXISTS (to test multiple elements at once):

BF.MEXISTS crawled-urls google.com reddit.com

Testing membership is very fast, but this is where the auto scaling functionality can have a negative impact if overused. Every time the filter is extended, it needs to look for the item in more alternative locations. Each check still happens very quickly, but choosing a bad base size might require the filter to scale up enough times to impact the performance of this command.

Deleting elements

Bloom filters don’t support deleting an element once it’s added to a filter, but the good news is that RedisBloom also includes a Cuckoo filter, an in-place replacement for the Bloom filter that also supports item deletion.

Count-min Sketch

Count-min Sketch (CM sketch) answers item frequency questions. In some ways CM sketch is similar to HyperLogLog as they both count elements, with the crucial difference that CM sketch doesn’t count unique elements, but rather how many copies of each element you added to it.

Examples of questions that CM sketch can answer:

  • Is this user making too many requests?
  • How common is this word in the document?
  • And more generally, is this element a “heavy hitter”?

As in the previous examples, there is some imprecision involved. In the case of CM sketch, the issue is that it always overestimates the counts. The degree of overestimation depends on the options that you specify when creating the counter.

Usage

Similarly to Bloom filters, CM sketch requires you to specify some settings when creating each filter. After that it’s just a matter of adding elements to it and querying the counts. Like with HLL, CM sketch also supports merging multiple counters into one.

Here’s a quick rundown of how a CM sketch usage session looks like in redis-cli.

> CMS.INITBYPROB word-freqs 0.001 0.001
OK

> CMS.INCRBY word-freqs banana 10 pear 5 apple 17
1) (integer) 10
2) (integer) 5
3) (integer) 17

> CMS.QUERY word-freqs banana ananas
1) (integer) 10
2) (integer) 0

TopK

TopK is basically a regular heap attached to a probabilistic frequency counter like Count-min Sketch. The result is that TopK will use the frequency estimates to keep in memory only the “top K” elements, for a configurable value of K.

If CM sketch was able to tell you the frequency of a given element, TopK is also able to return the elements themselves, if they’re frequent enough. Of all the data structures described here, this is the only one able to return (some of) the elements that you put in it. As such, the size of those elements matters in terms of space usage.

Here you can find the complete list of commands supported by TopK in RedisBloom.

Here’s a quick rundown of how a TopK usage session looks like in redis-cli.

> TOPK.RESERVE best-sellers 2
OK

> TOPK.INCRBY best-sellers HP 5 LOTR 10 WITCHER 140
1) (nil)
2) (nil)
3) HP

> TOPK.INCRBY best-sellers HP 50
1) LOTR

> TOPK.COUNT best-sellers HP LOTR WITCHER 50GRAY
1) (integer) 55
2) (integer) 10
3) (integer) 140
4) (integer) 0

> TOPK.LIST best-sellers
1) HP
2) WITCHER

Next steps

To learn more about streaming analytics with probabilistic data structures, check out RedisBloom, read the documentation, and spin up a Docker container to play with it. If you’re looking for ways to integrate Redis it on your streaming analytics pipelines, take a look at Redis Streams.