Aggregate Operations
Standard aggregate operations
A SELECT expression in the SQL grammar can contain one
or more aggregation functions. Aggregate functions are specified
using the following grammar:
aggregateCall:
agg '(' [ ALL | DISTINCT ] value [, value ]* ')'
[ WITHIN DISTINCT '(' expression [, expression ]* ')' ]
[ FILTER '(' WHERE condition ')' ]
| agg '(' '*' ')' [ FILTER (WHERE condition) ]
where agg is one of the operators in the following table.
If FILTER is present, the aggregate function only considers rows for
which condition evaluates to TRUE.
If DISTINCT is present, duplicate argument values are eliminated
before being passed to the aggregate function.
If WITHIN DISTINCT is present, argument values are made distinct
within each value of specified keys before being passed to the
aggregate function.
Most aggregation functions produce results of the same type as the
input data, but compute using higher precision intermediate data
types; aggregation of UNSIGNED values uses signed types for
intermediate results. If you expect the result to require a higher
precision than the aggregated data type, we recommend converting the
data to a wider data type, e.g.: instead of SELECT SUM(col), you
should write SELECT SUM(CAST col AS BIGINT).
Comparisons like MAX, MIN, ARG_MIN, and ARG_MAX are defined
for all data types, and they use the standard comparison
operations.
If FILTER is specified, then only the input rows for which the
filter_clause evaluates to true are fed to the aggregate function;
other rows are discarded. For example:
SELECT
count(*) AS unfiltered,
count(*) FILTER (WHERE i < 5) AS filtered
FROM TABLE
In addition, the following two constructors act as aggregates:
| Constructor | Description | Example |
|---|---|---|
ARRAY(sub-query) | Creates an array from the result of a sub-query. If the subquery returns a tuple, the array will be an array of tuples. | SELECT ARRAY(SELECT empno FROM emp) or SELECT ARRAY(SELECT empno, dept FROM emp) |
MAP(sub-query) | Creates a map from the result of a sub-query that returns two columns. If multiple entries have the same key, the largest value wins. | SELECT MAP(SELECT empno, deptno FROM emp) |
Window aggregate functions
A SELECT expression in the SQL grammar can also
contain a window aggregate function.
The following window aggregate functions are supported:
Currently, the window aggregate functions RANK, DENSE_RANK and
ROW_NUMBER are only supported if the compiler detects that they are
being used to implement a TopK pattern. This pattern is expressed in
SQL with the following structure:
SELECT * FROM (
SELECT empno,
row_number() OVER (ORDER BY empno) rn
FROM empsalary) emp
WHERE rn < 3
Pivots
The SQL PIVOT operation can be used to turn rows into columns. It
usually replaces a GROUP-BY operation when the group keys are known
in advance. Instead of producing one row for each group, PIVOT can
produce one column for each group.
Syntax
PIVOT ( { aggregate_expression [ AS aggregate_expression_alias ] } [ , ... ]
FOR column_with_data IN ( column_list ) )
Parameters
-
aggregate_expression Specifies an aggregate expression (
SUM,COUNT(DISTINCT ), etc.). -
aggregate_expression_alias Specifies a column name for the aggregate expression.
-
column_with_data A column that produces all the values that will become new column names.
-
column_list Columns that show the pivoted data.
Example
CREATE TABLE FURNITURE (
type VARCHAR,
year INTEGER,
count INTEGER
);
INSERT INTO FURNITURE VALUES
('chair', 2020, 4),
('table', 2021, 3),
('chair', 2021, 4),
('desk', 2023, 1),
('table', 2023, 2);
SELECT year, type, SUM(count) FROM FURNITURE GROUP BY year,type;
year | type | sum
-------------------
2020 | chair | 4
2021 | table | 3
2021 | chair | 4
2023 | desk | 1
2023 | table | 2
(5 rows)
SELECT * FROM FURNITURE
PIVOT (
SUM(count) AS ct
FOR type IN ('desk' AS desks, 'table' AS tables, 'chair' as chairs)
);
year | desks | tables | chairs
------------------------------
2020 | | | 4
2021 | | 3 | 4
2023 | 1 | 2 |
(3 rows)
Notice how the same information is presented in a tabular form where
we have a column for each type of object. PIVOTs require all the
possible "type"s to be specified when the query is written. Notice
that if we add an additional type, the GROUP BY query will produce a
correct result, while the PIVOT query will produce the same result.
INSERT INTO FURNITURE VALUES ('bed', 2020, 5);
SELECT year, type, SUM(count) FROM FURNITURE GROUP BY year,type;
year | type | sum
-------------------
2020 | chair | 4
2020 | bed | 5
2021 | table | 3
2021 | chair | 4
2023 | desk | 1
2023 | table | 2
(6 rows)
SELECT * FROM FURNITURE
PIVOT (
SUM(count) AS ct
FOR type IN ('desk' AS desks, 'table' AS tables, 'chair' as chairs)
);
year | desks | tables | chairs
------------------------------
2020 | | | 4
2021 | | 3 | 4
2023 | 1 | 2 |
(3 rows)
On the efficiency of aggregates computations
Computing aggregates incrementally is very different from standard aggregate evaluation in typical SQL engines. Some of the observations in this section pertain to the current state of the implementation, and may change in the future as the implementation improves. Let us assume that the size of the collection aggregated is N, the size of the current change is D, the total number of groups is G, and the total number of elements in the modified groups is M. Always N > D, and N > M > G.
All aggregation functions need to store the result of the aggregation internally -- one value per group, so their space overhead is at least O(G), but it may be more.
Window aggregates
Window aggregates (e.g., using OVER) are incrementally evaluated for
each window which changes when a new change is ingested, but are
otherwise insensitive to the choice of aggregation function or the
data type.
Window aggregation functions need to store the entire collection that is being aggregated -- the space overhead is thus O(N). The work performed is expected to be O(D log N).
DISTINCT
The DISTINCT operation can be used with an aggregation or in a
SELECT statement; in both cases the cost of DISTINCT is O(N) in
space and O(D log M) in work.
Linear aggregation functions
A linear aggregation function can compute the change in an aggregate only by looking at the new change -- irrespective of the previous value of the aggregate. Linear aggregation functions comprise:
COUNTSUMfor all integer, unsigned, andDECIMALdata typesAVGfor all integer, unsigned, andDECIMALdata typesSTDDEV,STDDEV_SAMP,STDDEV_POPfor all integer, unsigned, andDECIMALdata types (note: our current implementation ofSTDDEVis not as efficient as possible)
The space overhead for linear functions is O(G). The work performed for each change is O(D).
Non-linear aggregation functions
Using a FILTER with an aggregation function in general makes it
non-linear, so using WHERE is preferred to using FILTER.
Sometimes the compiler can automatically decompose such an aggregate
into a filter followed by a standard aggregate.
COUNTIF is the same as COUNT ... FILTER.
The following functions are non-linear, and require O(N) space and O(M) work:
BIT_OR,BIT_XOR,BIT_AND
Any of the functions listed above as linear are actually non-linear
when applied to DOUBLE or FLOAT values.
Efficient aggregation functions
The following aggregation functions require O(N) space but perform only O(D log M) work.
MAX,MIN,ARG_MAX,ARG_MINLOGICAL_AND,BOOL_AND,LOGICAL_OR,BOOL_OR,EVERY,SOME
Append only collections
Some aggregates can have more efficient implementations when applied to append-only collections. A table property can be used to indicate whether a table is append-only. Operations such as SELECT, WHERE, JOIN, UNNEST applied to append-only collections produce append-only results. Note the results of aggregation are essentially never append-only.
MAX,MIN,ARG_MAX,ARG_MINare significantly more efficient for append-only collections. They require O(G) space and O(D) work.
Expensive aggregation functions
-
ARRAY_AGGis very expensive, both in terms of space and time. Space cost is O(N), while work performed is proportional O(M). -
The two constructors
ARRAYandMAPwith subqueries as arguments have similar costs.