DateModifiedViewsCategory

Common table expression (CTE) helps organize your code and make it more efficient utilizing the same data multiple times. It is how it works in relational database management systems (RDBMSs). Trino handles CTEs differently because it is a query engine and it can't store temporary named result sets. CTEs are inlined before sending for execution by Trino. This limitation makes usage of CTE in Trino inefficient. Moreover, CTE can produce incorrect results as named result set is non-deterministic and the results may be different each time.

Instead of using a CTE in Trino, it can be created a temporary table. This way, you make your code more efficient retrieving data only once if a named result set is going to be used more than one time in your CTE statement. Otherwise, Trino executes the same named result set as many times as the named result set is referred in your statement.

Some products rely on CTEs heavily, for example, dbt and it impacts on usage of those products with Trino.

CTE explanation

filtered_cte and deduped_cte are inlined when executed by Trino.

Submitted to Trino statement

WITH 
filtered_cte AS (
    SELECT 
        region
        ,city
        ,population
        ,create_date
    FROM  
        data_source   
    WHERE  
        create_date>timestamp '2024-12-31')
deduped_cte AS (
    SELECT 
        region
        ,city
        ,population
        ,ROW_NUMBER() OVER(PARTITION BY region,city ORDER BY create_date DESC) AS dedupe_group
    FROM filtered_CTE)
SELECT
    *
FROM 
    deduped_cte
WHERE 
    dedupe_group = 1;

Executed Trino statement

SELECT
    *
FROM 
    (SELECT 
        region
        ,city
        ,population
        ,ROW_NUMBER() OVER(PARTITION BY region,city ORDER BY create_date DESC) AS dedupe_group
    FROM 
        (SELECT 
            region
            ,city
            ,population
            ,create_date
        FROM  
            data_source   
        WHERE  
            create_date>timestamp '2024-12-31') a
    ) b
WHERE 
    dedupe_group = 1;

CTE inefficiency

filtered_cte named result set is executed 2 times by Trino.

Submitted to Trino statement

WITH 
filter_cte AS (
    SELECT 
        region_id
        ,return_id
    FROM 
        data_source   
    WHERE  
        invoice_status='completed'
calc_cte1 AS (
    SELECT 
        region_id
        ,MAX(return_id) AS return_id
        ,COUNT(*) AS region_count
    FROM 
        filter_CTE
    GROUP BY
        region_id),
calc_cte2 AS (
    SELECT 
        return_id
        ,MAX(region_id) AS region_id
        ,COUNT(*) AS return_count
    FROM 
        filter_CTE
    GROUP BY
        return_id),
SELECT
    c1.region_id
    ,c1.region_count
    ,c2.return_id
    ,c2.return_count
FROM 
    calc_cte1 c1 
JOIN calc_cte2 c2 ON 
    c1.region_id=c2.region_id
    AND c1.return_id=c2.return_id;

Executed Trino statement

SELECT
    c1.region_id
    ,c1.region_count
    ,c2.return_id
    ,c2.return_count
FROM 
    (SELECT 
        region_id
        ,MAX(return_id) AS return_id
        ,COUNT(*) AS return_count
    FROM 
        (SELECT 
            region_id
            ,return_id 
        FROM
            data_source   
        WHERE
            invoice_status='completed') filter_cte
    GROUP BY 
        region_id) c1 
JOIN 
    (SELECT 
        return_id
        ,MAX(region_id) AS region_id
        ,COUNT(*) AS return_count
    FROM 
        (SELECT 
            region_id
            ,return_id 
        FROM  
            data_source   
        WHERE 
            invoice_status='completed') filter_cte
    GROUP BY
        return_id) c2 
    ON c1.region_id=c2.region_id
       AND c1.return_id=c2.return_id;

CTE with incorrect results

id column calculated 2 times, so it can't be used as a join column.

Submitted to Trino statement

WITH 
cte1 AS (
    SELECT
        region_id
        ,return_id
        ,CAST(UUID() AS VARCHAR) AS id
    FROM
        data_source),
cte21 AS (
    SELECT
        region_id
        ,id
    FROM
        cte1),
cte22 AS (
    SELECT
        return_id
        ,id
    FROM
        cte1)
SELECT
    cte21region_id
    ,cte22.return_id
    ,cte21.id
FROM
    cte21
JOIN
    cte22 
    ON cte21.id=cte22.id;

Executed by Trino statement

SELECT
    cte21.region_id
    ,cte22.return_id
    ,cte21.id
FROM
    (SELECT
        region_id
        ,id
    FROM
        (SELECT
            region_id
            ,return_id
            ,CAST(UUID() AS VARCHAR) AS id
        FROM
            data_source) cte1
    ) cte21
JOIN
    (SELECT
        return_id
        ,id
    FROM
        (SELECT
            region_id
            ,return_id
            ,CAST(UUID() AS VARCHAR) AS id
        FROM
            data_source) cte1
    ) cte22
    ON cte21.id=cte22.id;

Resources


Comments

comments powered by Disqus