Aggregate Operations

Standard aggregate operations

A SELECT expression in the SQL grammar can contain one or more aggregation functions. Aggregate functions are specified using the following grammar:

aggregateCall:
      agg '(' [ ALL | DISTINCT ] value [, value ]* ')'
      [ WITHIN DISTINCT '(' expression [, expression ]* ')' ]
      [ FILTER '(' WHERE condition ')' ]
  |   agg '(' '*' ')' [ FILTER (WHERE condition) ]

where agg is one of the operators in the following table.

If FILTER is present, the aggregate function only considers rows for which condition evaluates to TRUE.

If DISTINCT is present, duplicate argument values are eliminated before being passed to the aggregate function.

If WITHIN DISTINCT is present, argument values are made distinct within each value of specified keys before being passed to the aggregate function.

Most aggregation functions produce results of the same type as the input data, but compute using higher precision intermediate data types; aggregation of UNSIGNED values uses signed types for intermediate results. If you expect the result to require a higher precision than the aggregated data type, we recommend converting the data to a wider data type, e.g.: instead of SELECT SUM(col), you should write SELECT SUM(CAST col AS BIGINT).

Aggregate	Description
`ARRAY_AGG([ ALL \| DISTINCT ] value [ RESPECT NULLS \| IGNORE NULLS ] [ORDER BY orderItem [, orderItem]*] )`	Gathers all values in an array. If `ORDER BY` is not present, the order of the values in the array is unspecified (but it is deterministic).
`AVG( [ ALL \| DISTINCT ] numeric)`	Returns the average (arithmetic mean) of numeric across all input values
`ARG_MAX(value, compared)`	Returns a `value` for one of the rows containing the maximum value of `compared` in the group. The rule for selecting the value is not specified if there are multiple rows with the same maximum value.
`ARG_MIN(value, compared)`	Returns a `value` for one of the rows containing the minimum value of `compared` in the group. This rule for selecting the value is not specified if there are multiple rows with the same minimum value.
`BIT_AND( [ ALL \| DISTINCT ] value)`	Returns the bitwise AND of all non-null input values, or null if none; integer and binary types are supported
`BIT_OR( [ ALL \| DISTINCT ] value)`	Returns the bitwise OR of all non-null input values, or null if none; integer and binary types are supported
`BIT_XOR( [ ALL \| DISTINCT ] value)`	Returns the bitwise XOR of all non-null input values, or null if none; integer and binary types are supported
`COUNT(*)`	Returns the number of input rows
`COUNT( [ ALL \| DISTINCT ] value [, value ]*)`	Returns the number of input rows for which value is not null. If the argument contains multiple expressions, it counts only expressions where all fields are non-null.
`COUNTIF( boolean )`	Returns the number of input rows for which the condition is true.
`EVERY(condition)`	Returns `TRUE` if all of the values of condition are `TRUE`
`LOGICAL_AND` or `BOOL_AND`	Same as `EVERY`
`LOGICAL_OR` or `BOOL_OR`	Same as `SOME`
`MAX( [ ALL \| DISTINCT ] value)`	Returns the maximum value of value across all input values
`MIN( [ ALL \| DISTINCT ] value)`	Returns the minimum value of value across all input values
`SOME(condition)`	Returns `TRUE` if one or more of the values of condition is `TRUE`
`SUM( [ ALL \| DISTINCT ] numeric)`	Returns the sum of numeric across all input values
`STDDEV( [ ALL \| DISTINCT ] value)`	Synonym for `STDDEV_SAMP`
`STDDEV_POP( [ ALL \| DISTINCT ] value)`	Returns the population standard deviation of numeric across all input values
`STDDEV_SAMP( [ ALL \| DISTINCT ] value)`	Returns the sample standard deviation of numeric across all input values

Comparisons like MAX, MIN, ARG_MIN, and ARG_MAX are defined for all data types, and they use the standard comparison operations.

If FILTER is specified, then only the input rows for which the filter_clause evaluates to true are fed to the aggregate function; other rows are discarded. For example:

SELECT
    count(*) AS unfiltered,
    count(*) FILTER (WHERE i < 5) AS filtered
FROM TABLE

In addition, the following two constructors act as aggregates:

Constructor	Description	Example
`ARRAY(sub-query)`	Creates an array from the result of a sub-query. If the subquery returns a tuple, the array will be an array of tuples.	`SELECT ARRAY(SELECT empno FROM emp)` or `SELECT ARRAY(SELECT empno, dept FROM emp)`
`MAP(sub-query)`	Creates a map from the result of a sub-query that returns two columns. If multiple entries have the same key, the largest value wins.	`SELECT MAP(SELECT empno, deptno FROM emp)`

Window aggregate functions

A SELECT expression in the SQL grammar can also contain a window aggregate function. The following window aggregate functions are supported:

Aggregate	Description
`AVG(`numeric`)`	Returns the average (arithmetic mean) of numeric across all values in window
`COUNT(`value [, value ]*`)`	Returns the number of rows in window for which value is not null
`COUNT(*)`	Returns the number of rows in window
`DENSE_RANK()`	Returns the rank of the current row without gaps.
`FIRST_VALUE(expression)`	Returns the value of `expression` at the first row of the window frame. Currently only supported for windows with `UNLIMITED RANGE`.
`LAG(`expression, [offset, [ default ] ]`)`	Returns expression evaluated at the row that is offset rows before the current row within the partition; if there is no such row, instead returns default. Both offset and default are evaluated with respect to the current row. If omitted, offset defaults to 1 and default to `NULL`.
`LAST_VALUE(expression)`	Returns the value of `expression` at the last row of the window frame. Currently only supported for windows with `UNLIMITED RANGE`
`LEAD(`expression, [offset, [ default ] ]`)`	Returns expression evaluated at the row that is offset rows after the current row within the partition; if there is no such row, instead returns default. Both offset and default are evaluated with respect to the current row. If omitted, offset defaults to 1 and default to `NULL`.
`MAX`(expression)	Returns the maximum value of expression across all values in window
`MIN`(expression)	Returns the minimum value of expression across all values in window
`RANK()`	Returns the rank of the current row with gaps.
`ROW_NUMBER()`	Returns the number of the current row within its partition, counting from 1.
`SUM(`numeric`)`	Returns the sum of numeric across all values in window

Pivots

The SQL PIVOT operation can be used to turn rows into columns. It usually replaces a GROUP-BY operation when the group keys are known in advance. Instead of producing one row for each group, PIVOT can produce one column for each group.

Syntax

PIVOT ( { aggregate_expression [ AS aggregate_expression_alias ] } [ , ... ]
    FOR column_with_data IN ( column_list ) )

Parameters

aggregate_expression Specifies an aggregate expression (SUM, COUNT(DISTINCT ), etc.).
aggregate_expression_alias Specifies a column name for the aggregate expression.
column_with_data A column that produces all the values that will become new column names.
column_list Columns that show the pivoted data.

Example

CREATE TABLE FURNITURE (
   type VARCHAR,
   year INTEGER,
   count INTEGER
);
INSERT INTO FURNITURE VALUES
  ('chair', 2020, 4),
  ('table', 2021, 3),
  ('chair', 2021, 4),
  ('desk', 2023, 1),
  ('table', 2023, 2);

SELECT year, type, SUM(count) FROM FURNITURE GROUP BY year,type;
year | type  | sum
-------------------
2020 | chair | 4
2021 | table | 3
2021 | chair | 4
2023 | desk  | 1
2023 | table | 2
(5 rows)

SELECT * FROM FURNITURE
PIVOT (
    SUM(count) AS ct
    FOR type IN ('desk' AS desks, 'table' AS tables, 'chair' as chairs)
);

year | desks | tables | chairs
------------------------------
2020 |       |        |    4
2021 |       |     3  |    4
2023 |     1 |     2  |
(3 rows)

Notice how the same information is presented in a tabular form where we have a column for each type of object. PIVOTs require all the possible "type"s to be specified when the query is written. Notice that if we add an additional type, the GROUP BY query will produce a correct result, while the PIVOT query will produce the same result.

INSERT INTO FURNITURE VALUES ('bed', 2020, 5);
SELECT year, type, SUM(count) FROM FURNITURE GROUP BY year,type;
year | type  | sum
-------------------
2020 | chair | 4
2020 | bed   | 5
2021 | table | 3
2021 | chair | 4
2023 | desk  | 1
2023 | table | 2
(6 rows)

SELECT * FROM FURNITURE
PIVOT (
    SUM(count) AS ct
    FOR type IN ('desk' AS desks, 'table' AS tables, 'chair' as chairs)
);

year | desks | tables | chairs
------------------------------
2020 |       |        |    4
2021 |       |     3  |    4
2023 |     1 |     2  |
(3 rows)

On the efficiency of aggregates computations

Computing aggregates incrementally is very different from standard aggregate evaluation in typical SQL engines. Some of the observations in this section pertain to the current state of the implementation, and may change in the future as the implementation improves. Let us assume that the size of the collection aggregated is N, the size of the current change is D, the total number of groups is G, and the total number of elements in the modified groups is M. Always N > D, and N > M > G.

All aggregation functions need to store the result of the aggregation internally -- one value per group, so their space overhead is at least O(G), but it may be more.

Window aggregates

Window aggregates (e.g., using OVER) are incrementally evaluated for each window which changes when a new change is ingested, but are otherwise insensitive to the choice of aggregation function or the data type.

Window aggregation functions need to store the entire collection that is being aggregated -- the space overhead is thus O(N). The work performed is expected to be O(D log N).

The window aggregate functions RANK, DENSE_RANK, and ROW_NUMBER may be very expensive to evaluate incrementally, because it's possible for a very small input change to produce a very large output change: inserting or deleting a single row can change the numbering of all subsequent rows in the same group. These functions can have a reasonable cost in three circumstances:

each modified group (created by PARTITION BY) is relatively small in size
new insertions and deletions feature rows that appear towards the end of the order produced by the ORDER BY clause
they are used in a TopK pattern with a small limit. The topK is expressed in SQL with the following structure:

SELECT * FROM (
   SELECT empno,
          row_number() OVER (ORDER BY empno) rn
   FROM empsalary) emp
WHERE rn < 3

Window aggregation functions involving ROWS BETWEEN are implemented behind the scenes by computing ROW_NUMBER for each PARTITION BY group, so they inherit the cost of ROW_NUMBER as described above.

`DISTINCT`

The DISTINCT operation can be used with an aggregation or in a SELECT statement; in both cases the cost of DISTINCT is O(N) in space and O(D log M) in work.

Linear aggregation functions

A linear aggregation function can compute the change in an aggregate only by looking at the new change -- irrespective of the previous value of the aggregate. Linear aggregation functions comprise:

COUNT
SUM for all integer, unsigned, and DECIMAL data types
AVG for all integer, unsigned, and DECIMAL data types
STDDEV, STDDEV_SAMP, STDDEV_POP for all integer, unsigned, and DECIMAL data types

The space overhead for linear functions is O(G). The work performed for each change is O(D).

Non-linear aggregation functions

Using a FILTER with an aggregation function in general makes it non-linear, so using WHERE is preferred to using FILTER. Sometimes the compiler can automatically decompose such an aggregate into a filter followed by a standard aggregate.

COUNTIF is the same as COUNT ... FILTER.

The following functions are non-linear, and require O(N) space and O(M) work:

BIT_OR, BIT_XOR, BIT_AND

Any of the functions listed above as linear are actually non-linear when applied to DOUBLE or FLOAT values.

Efficient aggregation functions

The following aggregation functions require O(N) space but perform only O(D log M) work.

MAX, MIN, ARG_MAX, ARG_MIN
LOGICAL_AND, BOOL_AND, LOGICAL_OR, BOOL_OR, EVERY, SOME

Append only collections

Some aggregates can have more efficient implementations when applied to append-only collections. A table property can be used to indicate whether a table is append-only. Operations such as SELECT, WHERE, JOIN, UNNEST applied to append-only collections produce append-only results. Note the results of aggregation are essentially never append-only.

MAX, MIN, ARG_MAX, ARG_MIN are significantly more efficient for append-only collections. They require O(G) space and O(D) work.

Expensive aggregation functions

ARRAY_AGG is very expensive, both in terms of space and time. Space cost is O(N), while work performed is proportional O(M).
The two constructors ARRAY and MAP with subqueries as arguments have similar costs.

Standard aggregate operations​

Window aggregate functions​

Pivots​

Syntax​

Parameters​

Example​

On the efficiency of aggregates computations​

Window aggregates​

DISTINCT​

Linear aggregation functions​

Non-linear aggregation functions​

Efficient aggregation functions​

Append only collections​

Expensive aggregation functions​