Skip to content

Count-min sketch overview

Functions for estimating value counts using the count-min sketch data structure

Count the number of times a value appears in a column, using the probabilistic count-min sketch data structure and its associated algorithms. For applications where a small error rate is tolerable, this can result in huge savings in both CPU time and memory, especially for large datasets.

The count-min sketch produces a biased estimator of frequency. It might overestimate the item count, but it can’t underestimate. You can control the relative error and the probability that the estimate falls outside the error bounds.

This group of functions uses the two-step aggregation pattern.

Rather than calculating the final result in one step, you first create an intermediate aggregate by using the aggregate function.

Then, use any of the accessors on the intermediate aggregate to calculate a final result. You can also roll up multiple intermediate aggregates with the rollup functions.

The two-step aggregation pattern has several advantages:

  1. More efficient because multiple accessors can reuse the same aggregate
  2. Easier to reason about performance, because aggregation is separate from final computation
  3. Easier to understand when calculations can be rolled up into larger intervals, especially in window functions and continuous aggregates
  4. Perform retrospective analysis even when underlying data is dropped, because the intermediate aggregate stores extra information not available in the final result

To learn more, see the blog post on two-step aggregates.

Create a count-min sketch and estimate how many times specific user IDs appear:

WITH sketch AS (
SELECT toolkit_experimental.count_min_sketch(
user_id::text,
0.01, -- 1% error tolerance
0.01 -- 1% probability of exceeding error bounds
) AS cms
FROM user_events
)
SELECT
toolkit_experimental.approx_count(cms, 'user123') AS user123_count,
toolkit_experimental.approx_count(cms, 'user456') AS user456_count
FROM sketch;
  • approx_count(): estimate the number of times a value appears in a count-min sketch