Heartbeat aggregation overview
Determine system liveness from timestamped heartbeats with heartbeat_agg functions
Determine the overall liveness of a system from a series of timestamped heartbeats and a liveness interval. This aggregate can be used to report total uptime or downtime as well as report the time ranges where the system was live or dead.
You can combine multiple heartbeat aggregates to determine the overall health of a service. For example, the heartbeat aggregates from a primary and standby server could be combined to see if there was ever a window where both machines were down at the same time.
Two-step aggregation
Section titled “Two-step aggregation”This group of functions uses the two-step aggregation pattern.
Rather than calculating the final result in one step, you first create an intermediate aggregate by using the aggregate function.
Then, use any of the accessors on the intermediate aggregate to calculate a final result. You can also roll up multiple intermediate aggregates with the rollup functions.
The two-step aggregation pattern has several advantages:
- More efficient because multiple accessors can reuse the same aggregate
- Easier to reason about performance, because aggregation is separate from final computation
- Easier to understand when calculations can be rolled up into larger intervals, especially in window functions and continuous aggregates
- Can perform retrospective analysis even when underlying data is dropped, because the intermediate aggregate stores extra information not available in the final result
To learn more, see the blog post on two-step aggregates.
Functions in this group
Section titled “Functions in this group”Aggregate
Section titled “Aggregate”heartbeat_agg(): aggregate heartbeat data into an intermediate form for further computation
Accessors
Section titled “Accessors”uptime(): get the total uptime from the aggregatedowntime(): get the total downtime from the aggregateinterpolated_uptime(): get the total uptime, interpolating values at the boundaryinterpolated_downtime(): get the total downtime, interpolating values at the boundarylive_at(): determine if the system was live at a given timelive_ranges(): get all time ranges when the system was livedead_ranges(): get all time ranges when the system was deadnum_live_ranges(): count the number of live rangesnum_gaps(): count the number of gaps (downtime periods)trim_to(): trim the aggregate to a specific time range
Mutator
Section titled “Mutator”interpolate(): interpolate the state at interval boundaries
Rollup
Section titled “Rollup”rollup(): combine multiple intermediate aggregates