stats_agg (two variables) overview

Statistical analysis and linear regression functions for two-dimensional data

Perform linear regression analysis, for example to calculate correlation coefficient and covariance, on two-dimensional data. You can also calculate common statistics, such as average and standard deviation, on each dimension separately. These functions are similar to the PostgreSQL statistical aggregates, but they include more features and are easier to use in continuous aggregates and window functions. The linear regressions are based on the standard least-squares fitting method.

These functions work on two-dimensional data. To work with one-dimensional data, for example to calculate the average and standard deviation of a single variable, see the one-dimensional stats_agg functions.

Two-step aggregation

This group of functions uses the two-step aggregation pattern.

Rather than calculating the final result in one step, you first create an intermediate aggregate by using the aggregate function.

Then, use any of the accessors on the intermediate aggregate to calculate a final result. You can also roll up multiple intermediate aggregates with the rollup functions.

The two-step aggregation pattern has several advantages:

More efficient because multiple accessors can reuse the same aggregate
Easier to reason about performance, because aggregation is separate from final computation
Easier to understand when calculations can be rolled up into larger intervals, especially in window functions and continuous aggregates
Perform retrospective analysis even when underlying data is dropped, because the intermediate aggregate stores extra information not available in the final result

To learn more, see the blog post on two-step aggregates.

Samples

Calculate regression and statistical properties

Create a statistical aggregate that summarizes daily statistical data about two variables, val2 and val1, where val2 is the dependent variable and val1 is the independent variable. Use the statistical aggregate to calculate the average of the dependent variable and the slope of the linear-regression fit:

WITH t AS (
    SELECT
        time_bucket('1 day'::interval, ts) AS dt,
        stats_agg(val2, val1) AS stats2D
    FROM foo
    WHERE id = 'bar'
    GROUP BY time_bucket('1 day'::interval, ts)
)
SELECT
    average_x(stats2D),
    slope(stats2D)
FROM t;

Available functions

Aggregate

stats_agg(): aggregate data into an intermediate statistical aggregate form for further calculation

Accessors for y variable statistics

average_y(): calculate the average of the dependent variable from a statistical aggregate
stddev_y(): calculate the standard deviation of the dependent variable from a statistical aggregate
variance_y(): calculate the variance of the dependent variable from a statistical aggregate
skewness_y(): calculate the skewness of the dependent variable from a statistical aggregate
kurtosis_y(): calculate the kurtosis of the dependent variable from a statistical aggregate
sum_y(): calculate the sum of the dependent variable from a statistical aggregate

Accessors for regression analysis

corr(): calculate the correlation coefficient from a statistical aggregate
covariance(): calculate the covariance from a statistical aggregate
determination_coeff(): calculate the coefficient of determination (R²) from a statistical aggregate
slope(): calculate the slope of the linear regression line from a statistical aggregate
intercept(): calculate the y-intercept of the linear regression line from a statistical aggregate
x_intercept(): calculate the x-intercept of the linear regression line from a statistical aggregate

Accessors for aggregate information

num_vals(): get the number of values contained in a statistical aggregate

Rollup

rollup(): combine multiple two-dimensional statistical aggregates

Mutator

rolling(): create a rolling window aggregate for use in window functions