-
-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The use of JSON struct in clickhouse results in high storage space consumption #481
Comments
@shenqidebaozi do you have any performance comparison of the traces search procedure? |
@akvlad Currently, in the research of different products, there is no comparison of trace search performance. But I think storage costs are also important. |
@shenqidebaozi according to your opinion. How many GB of HDD can be a completely equal replacement of 1 CPU core? |
This is a good question, I don't know how to measure it. |
If there's anything really unused it can be avoided but I'm not sure that's the case. Compression and codec choices might also play a vital role and should be carefully reviewed. |
This could be because Uptrace uses zstd compression by default with Clickhouse. Does qryn allow specifying compression? qryn seems to be using zstd in only 3-4 fields, which explains the difference in size. Having an option to allow using zstd whenever possible, would reduce disk usage substantially. It would be useful to have an ENV for specifying compression algorithm and Level in qryn. For example the default zstd level is 1 compared to 3 when using zstd cli. |
This query shows that uptrace lets you configure compression type/level and that gets appended to the clickhouse schema. https://github.com/search?q=repo%3Auptrace%2Fuptrace%20ch_schema&type=code |
qryn/lib/db/maintain/scripts.js Lines 158 to 170 in b4cda9e
What is the specific purpose of the Line 18 in b4cda9e
For the second question, should we define |
Ths only downsize of using Nested is that it makes the field more strict than a string. Also worth mentioning fields like: |
@gaby we absolutely want compression choices to be as open as possible for experimenting. We could work on a set of ALTER statements we can use to experiment with. |
@lmangani That would be a good starting point, or updating the CREATE TABLE and testing with a big data set to see the difference in size/performance. Compression will add ltency and reduce throughput thus why it should be configurable. |
@gaby can also reduce JSON marshal、unmarshal once,this is helpful for bulk write and query |
According to ChatGPT the same Create Table SQL would look like: CREATE TABLE IF NOT EXISTS {{DB}}.traces_input {{{OnCluster}}} (
oid String DEFAULT '0' CODEC(ZSTD),
trace_id String CODEC(ZSTD),
span_id String CODEC(ZSTD),
parent_id String CODEC(ZSTD),
name String CODEC(ZSTD),
timestamp_ns Int64 CODEC(DoubleDelta, ZSTD),
duration_ns Int64 CODEC(ZSTD),
service_name String CODEC(ZSTD),
payload_type Int8 CODEC(ZSTD),
payload String CODEC(ZSTD),
tags Array(Tuple(String, String)) CODEC(ZSTD)
) Engine=Null When asked for adding levels based on field type it producss the following: CREATE TABLE IF NOT EXISTS {{DB}}.traces_input {{{OnCluster}}} (
oid String DEFAULT '0' CODEC(ZSTD(1)),
trace_id String CODEC(ZSTD(3)), -- Likely to benefit from more compression
span_id String CODEC(ZSTD(3)), -- Likely to benefit from more compression
parent_id String CODEC(ZSTD(1)),
name String CODEC(ZSTD(1)),
timestamp_ns Int64 CODEC(DoubleDelta, ZSTD(3)), -- Larger data size, benefits from more compression
duration_ns Int64 CODEC(ZSTD(3)), -- Larger data size, benefits from more compression
service_name String CODEC(ZSTD(1)),
payload_type Int8 CODEC(ZSTD(1)),
payload String CODEC(ZSTD(3)), -- Assuming payloads can be large/structured, they might benefit more
tags Array(Tuple(String, String)) CODEC(ZSTD(1))
) Engine=Null |
@gaby let us know how this plays out and if it produces a visible effect we can most definitely implement options to trigger it |
For the same 10 million traces, qryn requires 18GB of storage, while uptrace only requires 4GB, which seems to be due to the inability to optimize the payload use of JSON.
qryn/lib/db/maintain/scripts.js
Lines 158 to 170 in b4cda9e
The text was updated successfully, but these errors were encountered: