Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CFR and WTF diagnostics programs #88

Merged
merged 15 commits into from
Jun 13, 2024
Merged

Add CFR and WTF diagnostics programs #88

merged 15 commits into from
Jun 13, 2024

Conversation

amotl
Copy link
Member

@amotl amotl commented Dec 1, 2023

About

WTF

There are many smart queries to leverage information from CrateDB's internal sys.* tables. This subsystem aims to bundle and collect them, in order to unlock easy access from CLI and HTTP interfaces, and to be able to use them as building blocks for other software components.

CFR

Export and import CrateDB system tables conveniently.

Status

It is still a work in progress, and needs more attention and love. What can be done right now is outlined within the preview section. Any kind of help to expand this is much appreciated.

Documentation

Rendered in preview mode.

Both documents include directives to outline how operations work using Docker. However, the program can also be installed natively, also while not on PyPI yet. Please use one of those commands to acquire the software. When using Docker, make sure to use the cratedb-toolkit:pr-88 image on GHCR for preview purposes.

# Native
pip install --upgrade 'cratedb-wtf @ git+https://github.com/crate-workbench/cratedb-toolkit@cratedb-wtf'
# OCI
docker pull ghcr.io/crate-workbench/cratedb-toolkit:pr-88

Help

ctk cfr --help
ctk wtf --help

Backlog

This comment was marked as outdated.

@amotl amotl changed the title cratedb-wtf: Add cratedb-wtf diagnostics program Add cratedb-wtf diagnostics program Dec 1, 2023
Comment on lines 46 to 269
class Logs:
# TODO: Implement `tail` in one way or another. -- https://stackoverflow.com/q/4714975
# SELECT * FROM sys.jobs_log OFFSET -10;
# SELECT * FROM sys.jobs_log OFFSET (SELECT count(*) FROM sys.jobs_log)-10;
# https://cratedb.com/docs/crate/reference/en/latest/general/builtins/scalar-functions.html#to-char-expression-format-string
# https://cratedb.com/docs/crate/reference/en/latest/general/builtins/scalar-functions.html#date-format-format-string-timezone-timestamp
user_queries = """
SELECT
DATE_FORMAT('%Y-%m-%dT%H:%i:%s.%f', started) AS started,
DATE_FORMAT('%Y-%m-%dT%H:%i:%s.%f', ended) AS ended,
classification, stmt, username, node
FROM
sys.jobs_log
WHERE
stmt NOT LIKE '%sys.%' AND
stmt NOT LIKE '%information_schema.%'
"""
Copy link
Member Author

@amotl amotl Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @hlcianfagna, @hammerhead, and @WalBeh,

I would like to implement tailing the sys.jobs_log table in one way or another. Actually, tail --follow. Did you manage to do that yet, using any kind of OFFSET/LIMIT magic?

SELECT * FROM sys.jobs_log OFFSET -10;
SELECT * FROM sys.jobs_log OFFSET (SELECT count(*) FROM sys.jobs_log)-10;

Those statements outlined above failed for me. Is there any way to get the number of total records out of the subselect into the OFFSET parameter? Most probably, I am only too silly to make it work, so I am humbly asking for your support.

Cheers,
Andreas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 SELECT CURRENT_TIMESTAMP AS last_timestamp,
                            (ended / 10000) * 10000 + 5000 AS ended_time,
                            COUNT(*) / 10.0 AS qps,
                            TRUNC(AVG(ended::bigint - started::bigint), 2) AS duration,
                            UPPER(regexp_matches(stmt, '^\s*(\w+).*') [1]) AS query_type
FROM sys.jobs_log
WHERE ended > now() - ('15 minutes')::interval
GROUP BY 1,
         2,
         5
ORDER BY ended_time ASC

This is the query used in the panel to show the latest 15 minutes QPS, perhaps it can serve as inspiration :P

Copy link
Member Author

@amotl amotl Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. So this is synthesizing the limiting by timestamp, compared to what is stored within the ended column? Hmm. Naturally I'd favor a more generic solution, but in this case, it could make an acceptable workaround. Sweet.

What's stored in ended when, well, the query has not finished yet? For emulating a tail -f, I'd probably better rely on started instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems empty when it doesn't have a value.

Something like

Query ran at 2023-12-02T11:02:47.288Z on CrateDB 5.5.0
SELECT
  DATE_FORMAT(started)
FROM
  "sys"."jobs_log"
ORDER BY
  started DESC
LIMIT
  10
QUERY OK, 10 record(s) returned in 0.0045s
date_format(started)
"2023-12-02T11:02:40.890000Z"
"2023-12-02T11:02:40.825000Z"
"2023-12-02T11:02:40.822000Z"
"2023-12-02T11:02:40.819000Z"
"2023-12-02T11:02:40.819000Z"
"2023-12-02T11:02:40.819000Z"
"2023-12-02T11:02:40.819000Z"
"2023-12-02T11:02:35.892000Z"
"2023-12-02T11:02:35.821000Z"
"2023-12-02T11:02:35.821000Z"

Seem to work just fine as tail -f

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow what is new I would keep track of the last started timestamp observed and query what is new

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to get this patch merged, we postponed this by adding it as an item to the backlog.

https://github.com/crate-workbench/cratedb-toolkit/blob/1c437cad2533c2e47423e7c32a6f15e79f5adcd1/doc/wtf/backlog.md?plain=1#L5-L6

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We created a dedicated issue, because it's a prominent feature yet missing.

@amotl amotl force-pushed the cratedb-wtf branch 2 times, most recently from 0e3ca31 to 002a255 Compare December 4, 2023 04:04
@amotl amotl force-pushed the cratedb-wtf branch 3 times, most recently from 3f53a92 to 2b2e8ff Compare January 2, 2024 10:51
@amotl

This comment was marked as off-topic.

@codecov-commenter
Copy link

codecov-commenter commented Apr 16, 2024

Codecov Report

Attention: Patch coverage is 84.11215% with 119 lines in your changes missing coverage. Please review.

Project coverage is 80.59%. Comparing base (f353b7f) to head (5a9ef0e).

Files Patch % Lines
cratedb_toolkit/wtf/query_collector.py 58.26% 53 Missing ⚠️
cratedb_toolkit/cfr/systable.py 90.90% 12 Missing ⚠️
cratedb_toolkit/wtf/recorder.py 69.69% 10 Missing ⚠️
cratedb_toolkit/util/cli.py 52.94% 8 Missing ⚠️
cratedb_toolkit/util/data.py 46.15% 7 Missing ⚠️
cratedb_toolkit/util/database.py 56.25% 7 Missing ⚠️
cratedb_toolkit/wtf/cli.py 94.00% 6 Missing ⚠️
cratedb_toolkit/sqlalchemy/patch.py 85.29% 5 Missing ⚠️
cratedb_toolkit/util/platform.py 88.57% 4 Missing ⚠️
cratedb_toolkit/wtf/model.py 95.45% 3 Missing ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #88      +/-   ##
==========================================
+ Coverage   79.01%   80.59%   +1.57%     
==========================================
  Files          56       68      +12     
  Lines        1973     2705     +732     
==========================================
+ Hits         1559     2180     +621     
- Misses        414      525     +111     
Flag Coverage Δ
influxdb 36.96% <40.18%> (+1.18%) ⬆️
main 68.13% <83.57%> (+6.04%) ⬆️
mongodb 46.43% <39.51%> (-2.58%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@amotl amotl force-pushed the cratedb-wtf branch 6 times, most recently from 2f96848 to f0902b2 Compare April 17, 2024 00:42
@amotl amotl changed the title Add cratedb-wtf diagnostics program Add CFR and WTF diagnostics programs Apr 17, 2024
@amotl amotl force-pushed the cratedb-wtf branch 2 times, most recently from 5f62d1a to c3222b3 Compare April 17, 2024 08:15
@amotl amotl force-pushed the cratedb-wtf branch 2 times, most recently from 4d3b461 to 0b3a47c Compare May 8, 2024 09:40
Comment on lines +457 to +486
not_started = InfoElement(
name="shard_not_started",
label="Shards not started",
sql="""
SELECT *
FROM sys.allocations
WHERE current_state != 'STARTED';
""",
description="Information about shards which have not been started.",
)
not_started_count = InfoElement(
name="shard_not_started_count",
label="Number of shards not started",
description="Total number of shards which have not been started.",
sql="""
SELECT COUNT(*) AS not_started_count
FROM sys.allocations
WHERE current_state != 'STARTED';
""",
transform=get_single_value("not_started_count"),
)
Copy link
Member Author

@amotl amotl May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hlcianfagna just shared this snippet gem which might be sensible in this context. Thanks.

Regarding checking which shards are being relocated and any allocation issues as well as reasons for it, this information can be found by querying sys.shards and sys.allocations for all shards that are not started.

SELECT * FROM sys.shards WHERE state <> 'STARTED';
SELECT * FROM sys.allocations WHERE current_state <> 'STARTED';

@seut seut force-pushed the cratedb-wtf branch 2 times, most recently from 73b821a to c16ae9d Compare May 27, 2024 09:07
@seut
Copy link
Member

seut commented Jun 3, 2024

Is this already ready to reviewed or is this still in a draft mode? I'd be very fine this out of draft mode.

@amotl
Copy link
Member Author

amotl commented Jun 5, 2024

Thank you for your excellent and thorough review.

Many quirks you outlined have been originating from blatantly copying together valuable-looking queries from different sources. Trying to bring them into the shape outlined through wtf/library.py in a hurry, admittedly not knowing much about many details inside of them, the patch dearly needed support by experts in CrateDB SQL.

I like your suggestions, and will improve each item through corresponding fixups, or defer relevant items to later iterations.

@amotl
Copy link
Member Author

amotl commented Jun 13, 2024

I like your suggestions, and will improve each item through corresponding fixups, or defer relevant items to later iterations.

Thanks to your diligent review, this process yielded a whopping number of five items to be improved. I am deliberately postponing them to a later iteration, in order to get the PR merged soon, so we can start shipping the program. I hope you agree with that decision.

@amotl
Copy link
Member Author

amotl commented Jun 13, 2024

Re-issuing this PR for another review now. If you agree with all other resolvements, fixes, deferrals, or not, please let me also know which improvement you suggest at #88 (comment).

Copy link
Member

@seut seut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice stuff, thanks!
Probably a few commits could be squashed...

amotl and others added 15 commits June 13, 2024 15:38
Add basic implementation for `sys-export` and `sys-import` subcommands.

It is about exporting system tables of CrateDB into SQL DDL and JSONL
files, and re-importing them for later analysis.
- Forward CLI options `--listen` and `--reload` to implementation
- Improve documentation about the `--listen` and `--reload` options
- Assign random port when `--listen` is supplied without port number
> This looks unreasonable complicated to just translate the primary boolean
> into a string.

Co-authored-by: Sebastian Utz <su@rtme.net>
@amotl
Copy link
Member Author

amotl commented Jun 13, 2024

Thanks for the acknowledgement. While I agree, I don't know which commits exactly to squash best, and how to shape corresponding scopes and meaningful commit messages. So, I will be merging as-is, in order not to delay this patch any longer.

Being still in its infancy, I think it is good enough to have more detailed commit granularity in CrateDB Toolkit, until it will reach 0.1.x, or such.

@amotl amotl merged commit d191cb9 into main Jun 13, 2024
21 checks passed
@amotl amotl deleted the cratedb-wtf branch June 13, 2024 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants