Skip to content

Improve primary keys of clickhouse tables #410

@hellais

Description

@hellais

Some of the DB tables could have their PRIMARY KEYs changed to improve query performance.

For example the PRIMARY KEY of the obs_web table is this:

ENGINE = ReplicatedReplacingMergeTree('/clickhouse/{cluster}/tables/ooni/obs_web_repl/{shard}', '{replica}')
PARTITION BY concat(substring(bucket_date, 1, 4), substring(bucket_date, 6, 2))
PRIMARY KEY (measurement_uid, observation_idx)

while the analysis table is this:

ENGINE = ReplicatedReplacingMergeTree('/clickhouse/{cluster}/tables/ooni/analysis_web_measurement/{shard}', '{replica}')
PARTITION BY substring(measurement_uid, 1, 6)
PRIMARY KEY measurement_uid
ORDER BY (measurement_uid, measurement_start_time, probe_cc, probe_asn, domain)

What this means is that it's triggering very large scans (effectively a full scan), when performing very common filtering operations over the measurement_start_time, probe_cc, probe_asn and domain.

Unfortunately it's not possible to change these with an ALTER query, so instead we are going to have to create the new table and copy the data over.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions