MergeTree
MergeTree
is the foundational engine in ClickHouse, optimized for high-volume data ingestion and fast analytical queries. It supports background merging of data parts, primary key indexing, and partitioning.
Key Features:
- Stores data sorted by a primary key, enabling fast range scans
- Breaks data into granules (~8192 rows) for efficient index pruning
- Supports partitioning (e.g., by day/month) for retention and query optimization
- Merges data parts in the background to reduce small parts and reclaim space
- Ideal for append-only data like web events, logs, and telemetry
- Event tracking (e.g.,
web_events
,page_views
) - Session reconstruction from time-series data
- Time-partitioned storage with retention logic
ReplacingMergeTree
ReplacingMergeTree
extends MergeTree
by adding deduplication capabilities during background merges. It is useful when duplicate or outdated rows are expected, such as from retries or updates.
Key Features:
- Removes duplicates based on the
ORDER BY
key during merges - If a
version
column is specified, retains only the most recent version - Deduplication happens only during background merges, not on insert
- Cleaning up duplicates from at-least-once ingestion
- Storing dimension tables that receive frequent corrections
- Handling data updates with soft upserts
Deduplication works on the sorting key, not the primary key. Requires correct ordering and optional versioning to work effectively.
AggregatingMergeTree
AggregatingMergeTree
is designed to store pre-aggregated data using aggregate function states. It is tailored for analytics workloads where aggregation results are computed once and queried frequently.
Key Features:
- Stores aggregate function states like
sumState()
,uniqState()
, etc. - Merges rows with the same key using merge functions (e.g.,
sumMerge()
) - Reduces query latency and compute cost for recurring aggregations
- Materialized views with pre-aggregated metrics (e.g., DAU, conversions)
- Storing rollup tables for dashboards or reports
- Efficient support for time-bucketed aggregations
️ Not suitable for raw event storage. Must use aggregate functions when inserting data.
Choosing the Right Engine
Use Case | Recommended Engine | Rationale |
---|---|---|
Raw event ingestion | MergeTree | Fast inserts, no deduplication, full event fidelity |
Deduplicated or updated records | ReplacingMergeTree | Handles retries and corrections via background deduplication |
Pre-aggregated summaries | AggregatingMergeTree | Efficient storage and querying of aggregate function states |