Tiger Data cookbook
Browse code examples from the Tiger community that resolve common issues and provide solutions for non-standard tasks
The Tiger Data Community shared these snippets for odd problems and non-standard setups. Treat them as patterns to adapt, not as fully supported product recipes.
Prerequisites for this tutorial
To follow the procedure on this page you need to:
-
Create a target Tiger Cloud service.
This procedure also works for self-hosted TimescaleDB.
Hypertable tutorials
This section contains short tutorials about hypertables.
Remove duplicates from an existing hypertable
Looking to remove duplicates from an existing hypertable? One method is to run a PARTITION BY query to get
ROW_NUMBER() and then the ctid of rows where row_number>1. You then delete these rows. However,
you need to check tableoid and ctid. This is because ctid is not unique and might be duplicated in
different chunks. The following code example took 17 hours to process a table with 40 million rows:
CREATE OR REPLACE FUNCTION deduplicate_chunks(ht_name TEXT, partition_columns TEXT, bot_id INT DEFAULT NULL) RETURNS TABLE ( chunk_schema name, chunk_name name, deleted_count INT )AS$$DECLARE chunk RECORD; where_clause TEXT := ''; deleted_count INT;BEGIN IF bot_id IS NOT NULL THEN where_clause := FORMAT('WHERE bot_id = %s', bot_id); END IF;
FOR chunk IN SELECT c.chunk_schema, c.chunk_name FROM timescaledb_information.chunks c WHERE c.hypertable_name = ht_name LOOP EXECUTE FORMAT(' WITH cte AS ( SELECT ctid, ROW_NUMBER() OVER (PARTITION BY %s ORDER BY %s ASC) AS row_num, * FROM %I.%I %s ) DELETE FROM %I.%I WHERE ctid IN ( SELECT ctid FROM cte WHERE row_num > 1 ) RETURNING 1; ', partition_columns, partition_columns, chunk.chunk_schema, chunk.chunk_name, where_clause, chunk.chunk_schema, chunk.chunk_name) INTO deleted_count;
RETURN QUERY SELECT chunk.chunk_schema, chunk.chunk_name, COALESCE(deleted_count, 0); END LOOP;END$$ LANGUAGE plpgsql;
SELECT *FROM deduplicate_chunks('nudge_events', 'bot_id, session_id, nudge_id, time', 2540);Shoutout to Mathias Ose and Christopher Piggott for this tutorial.
Get faster JOIN queries with Common Table Expressions
Imagine there is a query that joins a hypertable to another table on a shared key:
SELECT timestamp, FROM hypertable as h JOIN related_table as rt ON rt.id = h.related_table_id WHERE h.timestamp BETWEEN '2024-10-10 00:00:00' AND '2024-10-17 00:00:00'If you run EXPLAIN on this query, you see that the query planner performs a NestedJoin between these two tables, which means querying the hypertable multiple times. Even if the hypertable is well indexed, if it is also large, the query will be slow. How do you force a once-only lookup? Use materialized Common Table Expressions (CTEs).
If you split the query into two parts using CTEs, you can materialize the hypertable lookup and force PostgreSQL to perform it only once.
WITH cached_query AS materialized ( SELECT * FROM hypertable WHERE BETWEEN '2024-10-10 00:00:00' AND '2024-10-17 00:00:00') SELECT * FROM cached_query as c JOIN related_table as rt ON rt.id = h.related_table_idNow if you run EXPLAIN once again, you see that this query performs only one lookup. Depending on the size of your hypertable, this could result in a multi-hour query taking mere seconds.
Shoutout to Rowan Molony for this tutorial.
IoT tutorials
This section contains short tutorials for common IoT scenarios:
Work with columnar IoT data
Narrow and medium width tables are a great way to store IoT data. A lot of reasons are outlined in Designing Your Database Schema: Wide vs. Narrow PostgreSQL Tables.
One of the key advantages of narrow tables is that the schema does not have to change when you add new sensors. Another big advantage is that each sensor can sample at different rates and times. This helps support things like hysteresis, where new values are written infrequently unless the value changes by a certain amount.
Narrow table format example
Working with narrow table data structures presents a few challenges. In the IoT world one concern is that many data analysis approaches - including machine learning as well as more traditional data analysis - require that your data is resampled and synchronized to a common time basis. Fortunately, TimescaleDB provides you with hyperfunctions and other tools to help you work with this data.
An example of a narrow table format is:
| ts | sensor_id | value |
|---|---|---|
| 2024-10-31 11:17:30.000 | 1007 | 23.45 |
Typically you would couple this with a sensor table:
| sensor_id | sensor_name | units |
|---|---|---|
| 1007 | temperature | degreesC |
| 1012 | heat_mode | on/off |
| 1013 | cooling_mode | on/off |
| 1041 | occupancy | number of people in room |
A medium table retains the generic structure but adds columns of various types so that you can use the same table to store float, int, bool, or even JSON (jsonb) data:
| ts | sensor_id | d | i | b | t | j |
|---|---|---|---|---|---|---|
| 2024-10-31 11:17:30.000 | 1007 | 23.45 | null | null | null | null |
| 2024-10-31 11:17:47.000 | 1012 | null | null | TRUE | null | null |
| 2024-10-31 11:18:01.000 | 1041 | null | 4 | null | null | null |
To remove all-null entries, use an optional constraint such as:
CONSTRAINT at_least_one_not_null CHECK ((d IS NOT NULL) OR (i IS NOT NULL) OR (b IS NOT NULL) OR (j IS NOT NULL) OR (t IS NOT NULL))Get the last value of every sensor
There are several ways to get the latest value of every sensor. The following examples use the structure defined in Narrow table format example as a reference
SELECT DISTINCT ON
If you have a list of sensors, the easy way to get the latest value of every sensor is to use
SELECT DISTINCT ON:
WITH latest_data AS ( SELECT DISTINCT ON (sensor_id) ts, sensor_id, d FROM iot_data WHERE d is not null AND ts > CURRENT_TIMESTAMP - INTERVAL '1 week' -- important ORDER BY sensor_id, ts DESC)SELECT sensor_id, sensors.name, ts, dFROM latest_dataLEFT OUTER JOIN sensors ON latest_data.sensor_id = sensors.idWHERE latest_data.d is not nullORDER BY sensor_id, ts; -- Optional, for displaying results ordered by sensor_idThe common table expression (CTE) used above is not strictly necessary. However, it is an elegant way to join to the sensor list to get a sensor name in the output. If this is not something you care about, you can leave it out:
SELECT DISTINCT ON (sensor_id) ts, sensor_id, d FROM iot_data WHERE d is not null AND ts > CURRENT_TIMESTAMP - INTERVAL '1 week' -- important ORDER BY sensor_id, ts DESCIt is important to take care when down-selecting this data. In the previous examples,
the time that the query would scan back was limited. However, if there any sensors that have either
not reported in a long time or in the worst case, never reported, this query devolves to a full table scan.
In a database with 1000+ sensors and 41 million rows, an unconstrained query takes over an hour.
JOIN LATERAL
An alternative to SELECT DISTINCT ON is to use a JOIN LATERAL. By selecting your entire
sensor list from the sensors table rather than pulling the IDs out using SELECT DISTINCT, JOIN LATERAL can offer
some improvements in performance:
SELECT sensor_list.id, latest_data.ts, latest_data.dFROM sensors sensor_list -- Add a WHERE clause here to downselect the sensor list, if you wishLEFT JOIN LATERAL ( SELECT ts, d FROM iot_data raw_data WHERE sensor_id = sensor_list.id ORDER BY ts DESC LIMIT 1) latest_data ON trueWHERE latest_data.d is not null -- only pulling out float values ("d" column) in this example AND latest_data.ts > CURRENT_TIMESTAMP - interval '1 week' -- importantORDER BY sensor_list.id, latest_data.ts;Limiting the time range is important, especially if you have a lot of data. Best practice is to use these kinds of queries for dashboards and quick status checks. To query over a much larger time range, encapsulate the previous example into a materialized query that refreshes infrequently, perhaps once a day.
Shoutout to Christopher Piggott for this tutorial.