Count-min sketch overview
Functions for estimating value counts using the count-min sketch data structure
Count the number of times a value appears in a column, using the probabilistic count-min sketch data structure and its associated algorithms. For applications where a small error rate is tolerable, this can result in huge savings in both CPU time and memory, especially for large datasets.
The count-min sketch produces a biased estimator of frequency. It might overestimate the item count, but it can’t underestimate. You can control the relative error and the probability that the estimate falls outside the error bounds.
Two-step aggregation
Section titled “Two-step aggregation”This group of functions uses the two-step aggregation pattern.
Rather than calculating the final result in one step, you first create an intermediate aggregate by using the aggregate function.
Then, use any of the accessors on the intermediate aggregate to calculate a final result. You can also roll up multiple intermediate aggregates with the rollup functions.
The two-step aggregation pattern has several advantages:
- More efficient because multiple accessors can reuse the same aggregate
- Easier to reason about performance, because aggregation is separate from final computation
- Easier to understand when calculations can be rolled up into larger intervals, especially in window functions and continuous aggregates
- Perform retrospective analysis even when underlying data is dropped, because the intermediate aggregate stores extra information not available in the final result
To learn more, see the blog post on two-step aggregates.
Samples
Section titled “Samples”Estimate counts for specific values
Section titled “Estimate counts for specific values”Create a count-min sketch and estimate how many times specific user IDs appear:
WITH sketch AS ( SELECT toolkit_experimental.count_min_sketch( user_id::text, 0.01, -- 1% error tolerance 0.01 -- 1% probability of exceeding error bounds ) AS cms FROM user_events)SELECT toolkit_experimental.approx_count(cms, 'user123') AS user123_count, toolkit_experimental.approx_count(cms, 'user456') AS user456_countFROM sketch;Available functions
Section titled “Available functions”Aggregate
Section titled “Aggregate”count_min_sketch(): aggregate data into a count-min sketch for approximate counting
Accessor
Section titled “Accessor”approx_count(): estimate the number of times a value appears in a count-min sketch