How to Get Started with Probabilistic Data Structures: Count-Min Sketch

Ariel Shtul by Ariel Shtul

Count-min sketch (also called CM sketch) is a probabilistic data structure that’s extremely useful once you grasp how it works and, more importantly, how to use it.

Fortunately, CM sketch’s simple characteristics make it relatively easy for novices to understand (turns out many of my friends were unable to follow along with last month’s Top-K blog).

CM sketch has been a Redis module for several years and was recently rewritten as part of the RedisBloom module v2.0. But before we dive into CM sketch, it is important to understand why you’d use any probabilistic data structure. In the triangle of speed, space, and accuracy, probabilistic data structures sacrifice some accuracy to gain space—potentially a lot of space! The effect on speed varies based on algorithms and set sizes.

  • Accuracy: By definition, keeping only part of your data and allowing collisions in storage reduces accuracy. Yet you can set the maximum error rate based on your use case.
  • Memory space: In a world of big data where billions of events are recorded, storing only partial data can tremendously reduce storage space requirements and costs.
  • Speed: Certain traditional data structures operate relatively inefficiently, slowing response time. (For example, a Sorted Set maintains the order of all elements in it, but you might be interested in just the top elements. Because probabilistic algorithms maintain only a partial list, they are more efficient and can often answer queries much faster.

The right probabilistic data structure allows you to keep only part of the information in your dataset in exchange for reduced accuracy. Of course, in many cases (bank accounts, for example), reduced accuracy is unacceptable. But for recommending a movie or showing ads to website users, the cost of a relatively rare mistake is low and the space savings could be substantial.

Basically, CM sketch works by aggregating the count of all items in your dataset into several counter arrays. Upon a query, it returns the item’s minimum count of all arrays, which promises to minimize the count inflation caused by collisions. Items with a low appearance rate or score (“mouse flows”) have a count below the error rate, so you lose any data about their real count and they are treated as noise. Items with a high appearance rate or score (“elephant flows”), simply use the count received. Considering CM sketch’s size is constant and it can be used for an infinite number of items, you can see the potential for huge savings in storage space.

For background, sketches are a class of data structures and their accompanying algorithms. They capture the nature of your data in order to answer questions about it, while using constant or sublinear space. CM sketch was described by Graham Cormode and S. Muthu Muthukrishnan in a 2005 paper called An Improved Data Stream Summary: The Count-Min Sketch and its Applications.”

CM sketch: What and how

CM sketch uses several arrays of counters to support its primary use cases. Let’s call the number of arrays “depth” and the number of counters in each array “width.”

Whenever we receive an item, we use a hash function (which turns an element—a word, sentence, number, or binary—into a number that can be used as a location in the set/array or as a fingerprint)  to calculate the location of the item and increase that counter for each array. Each of the associated counters has a value equal to or higher than  the real value of the item. When we make an inquiry, we go through all the arrays with the same hash functions and fetch the counter associated with our item. We then return the minimum value we encountered, since we know our values are inflated (or equal).

Even though we know that many items contribute to most counters, because of natural collisions (when different items receive the same location), we accept the ‘noise’ as long as it remains within our desired error rate.

Example

The math dictates that with a depth of 10 and a width of 2,000, the probability of not having an error is 99.9% and the error rate is 0.1%. (This is the percentage of total increments, not of unique items).

In real numbers, that means if you add 1 million unique items, on average, each item will receive a value of 500 (1M/2K). Though that seems disproportionate, this falls well within our error rate of 0.1%, which is 1,000 on 1 million items.

Similarly, if 10 elephants appear 10,000 times each, their value on all sets would be 10,000 or more. Whenever we count them in the future, we’d see an elephant in front of us. For all other numbers (i.e., all mice whose real count is 1), they are unlikely to collide with an elephant on all sets (since CM sketch considers only the minimum count) since the probability of this happening is slim and diminishes further if you increase the depth as you initialized CM sketch.

What is CM sketch good for?

Now that we understand the behavior of CM sketch, what can we do with this little beast? Here are some common use cases:

  • Query two numbers and compare their count.
  • Set a percentage of incoming items, let’s say 1%. Whenever an item’s min-count is higher than 1% of the total count, we return true. This can be used to determine the top player of an online game, for example.
  • Add a min-heap to CM sketch and create a Top-K data structure. Whenever we increase an item, we check if the new min-count is higher than the minimum in the heap and update it accordingly. Unlike the Top-K module in RedisBloom, which decays gradually over time, CM sketch never forgets and therefore its behavior is slightly different than the HeavyKeeper-based Top-K.

In RedisBloom, the API for CM sketch is simple and easy:

  • To initialize a CM sketch data structure: INITBYDIM {key} {width} {depth} or CMS.INITBYPROB {key} {error} {probability}
  • To increase the counters of an item: CMS.INCRBY {key} {item} {increment}
  • To get the minimum count found in the item’s counters: CMS.QUERY {key} {item}

The following commands were used to create the animated example at the top of this post:

As you can see, the value of ‘Redis’ is 4 instead of 3. This behavior is expected, since in CM sketch, the count of an item is likely to be inflated.

If you enjoyed this post or want to know more about probabilistic data structures, please feel free to contact me.

[Abacus photo by Crissy Jarvis on Unsplash.]