From c958890d4f2d7f9089aab62f27a98a5d4c0b8095 Mon Sep 17 00:00:00 2001 From: Nick Woolmer <29717167+nwoolmer@users.noreply.github.com> Date: Fri, 29 Nov 2024 14:39:13 +0000 Subject: [PATCH] clarify dedup sorting behaviour --- documentation/concept/deduplication.md | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/documentation/concept/deduplication.md b/documentation/concept/deduplication.md index d35fad91..fedaace7 100644 --- a/documentation/concept/deduplication.md +++ b/documentation/concept/deduplication.md @@ -42,6 +42,29 @@ precisely, deduplicating the data based on the device ID can be expensive. However, in cases where CPU metrics are sent at random and typically have unique timestamps, the cost of deduplication is negligible. +:::note + +The ordering of how rows with duplicate timestamps are written on-disk differs when deduplication is enabled. + +- Without deduplication: + - the insertion order of each row will be preserved for rows with the same timestamp +- With deduplication: + - the rows will be stored in order sorted by the `DEDUP UPSERT` keys, with the same timestamp + +For example: + +```questdb-sql +DEDUP UPSERT keys(timestamp, symbol, price) + +-- will sorted like this on-disk: + +ORDER BY timestamp, symbol, price +``` + +This is the natural order of data returned in plain queries, without any grouping, filtering or ordering. + +::: + ## Configuration Create a WAL-enabled table with deduplication using