Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Structured metadata stream stats #15427

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

salvacorts
Copy link
Contributor

@salvacorts salvacorts commented Dec 16, 2024

What this PR does / why we need it:

This PR adds structured metadata (SM) stats to the TSDB index. For each stream we save a list of available structured metadata field names. We then use these stats to:

  • Return a list of available SM fields for the /api/labels endpoint (both on the ingester and the index).
  • On the query path, skip all series not having a given SM field we are filtering by (here).

On ingesters, for each stream we keep (here):

  • A closed set of stats: has the stats up to the most recent closed chunk.
  • An open set of stats: these are updated as we receive push requests (closed set + newer data).
    On the flush loop, the Ingester updates the TSDB with the closed set of stats.

We store these stats next to the label set of each stream on TSDB. As for regular labels, SM names are referenced using the symbols table.
image

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@salvacorts salvacorts changed the title Structured metadata indexing feat: Structured metadata stream stats Dec 16, 2024
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Dec 16, 2024
Copy link
Contributor

github-actions bot commented Dec 16, 2024

@salvacorts salvacorts marked this pull request as ready for review December 17, 2024 11:09
@salvacorts salvacorts requested a review from a team as a code owner December 17, 2024 11:09
@chaudum
Copy link
Contributor

chaudum commented Dec 18, 2024

Return a list of available SM fields for the /api/labels endpoint (both on the ingester and the index).

IMO we should have a dedicated API endpoint, such as /api/v1/fields rather than packing structured metadata into the "labels" term.
Having an API endpoint for both labels, fields (structured metadata) and "detected fields" (aka fields from parsed JSON/logfmt logs) would be quite nice for several reasons in Grafana.

@trevorwhitney
Copy link
Collaborator

trevorwhitney commented Dec 18, 2024

@chaudum we already have a /api/v1/detected_fields endpoint, and it already returns structured metadata. it's slower than this approach as it just scans log lines up to a limit, since it's extracting both fields and structured metadata, the former of which we can't index.

@salvacorts I haven't looked at all the changes here, but I love the idea of putting structured metadata keys in the index! I think we should refactor /api/v1/detected_fields to return the structured metadata key names, and keep the logic of /api/v1/detected_field/{name}/values for getting the values. That to me feels like a less disruptive change than adding them to /api/v1/labels, which would complicate some UIs (since structured metadata needs to be added to a query as a filter, more similar to fields than labels)

@salvacorts
Copy link
Contributor Author

Thank you @trevorwhitney and @chaudum

We have been discussing this issue in Slack but I wanted to share my answer here as well.

re. using /api/v1/detected_fields
AFAICT, detected fields performs a Loki log selector query right? That’s expensive compared to using the labels endpoint.

I think the workflow in Grafana should look sth like:

  1. As the user types the stream selector:
    • Suggest stream labels autocompletion
    • Suggest structured metadata autocompletion
  2. As the user types the filtering expr after the stream selector
    • Suggest structured metadata fields (got during stream selection, i.e. no extra calls to Loki)
    • Suggest detected fields (via /detected_fields)

re. implementing a new /fields endpoint
Grafana would hit both the /labels and /fields. For /labels we are already iterating the index pretty much the same way /fields would. Is it worth all the extra compute work due to hitting two endpoints instead of piggybacking the SM fields on the /labels response?


On a separate note, I'm going on PTO for some weeks. I think @shantanualsi may be able to take over this PR if we want to continue working on this one during holidays season.

@trevorwhitney
Copy link
Collaborator

AFAICT, detected fields performs a Loki log selector query right? That’s expensive compared to using the labels endpoint.

@salvacorts yes, which is why I think we should refactor it to get structured metadata keys from the index (but still use the log selector query to get structured metadata values).I agree getting fields + structured metadata would still be slow, so maybe we add a query parameter for just getting the structured metadata?

My concern with adding this to /labels is how do we expose the structured metadata values (which will still have to do a log selector query). Do you think we would weave that into /label/{name}/values? My original thinking for keeping that in the same path as /detected_fields rather than /labels is the performance of getting a label value vs a structured metadata one would be vastly different.

If we do keep this in /labels, then we need to remove it from /detected_fields and /detected_field/{name}/values. Furthermore, we'll have to think of how this affects the information architecture in Explore Logs (cc: @svennergr @matyax @gtk-grafana). For example, does structured metadata move to the labels tab? Stay in the Detected Fields tab? Get it's own tab (an idea we discussed and decided against previously).

I do like your auto-complete use case though, for having this either in /labels or something else that is very fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants