Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support split align and caching for instant metric query results #11814

Merged
merged 33 commits into from
Feb 20, 2024

Conversation

kavirajk
Copy link
Contributor

@kavirajk kavirajk commented Jan 29, 2024

What this PR does / why we need it:
Follow up to metadata results caching, this PR adds support for instant metric queries. It also adds support to enable aligning the subquery for more reusability.

Config changes:

  1. cache_instant_metric_results - Enable/disable (default disable) - Boolean
  2. instant_metric_results_cache - CacheConfig to tweak (usually not needed, have sane defaults)
  3. instant_metric_query_split_align - Enable/disable (default disable) - Boolean

How it works (without split align)

Consider following query
Query: sum(rate({foo="bar"}[3h])) @ 12:34:00
SplitInterval: 1h

So we need results from 09:34:00 to 12:34:00. (3h total)

Currently, After range mapper, it splits into

  1. sum(rate({foo="bar"}[1h])) @ 12:34:00
  2. sum(rate({foo="bar"}[1h] offset 1h)) @ 12:34:00
  3. sum(rate({foo="bar"}[1h] offset 2h)) @ 12:34:00

Even if we remove the offset it turns into

  1. sum(rate({foo="bar"}[1h])) @ 12:34:00
  2. sum(rate({foo="bar"}[1h])) @ 11:34:00
  3. sum(rate({foo="bar"}[1h])) @ 10:34:00

But the problem is now eval time is not aligned. And it's mostly unlikely these subqueries are reused.

How it works (with split align)

Now consider the same exact query
Query: sum(rate({foo="bar"}[3h])) @ 12:34:00
SplitInterval: 1h

After range mapper, it splits into

  1. sum(rate({foo="bar"}[34m])) @ 12:34:00
  2. sum(rate({foo="bar"}[1h] offset 34m)) @ 12:34:00
  3. sum(rate({foo="bar"}[1h] offset 1h 34m)) @ 12:34:00
  4. sum(rate({foo="bar"}[26m] offset 2h 34m)) @ 12:34:00

And after removing the offset it tuns into (properly eval time aligned)

  1. sum(rate({foo="bar"}[34m])) @ 12:34:00
  2. sum(rate({foo="bar"}[1h])) @ 12:00:00
  3. sum(rate({foo="bar"}[1h])) @ 11:00:00
  4. sum(rate({foo="bar"}[26m])) @ 10:00:00

Now we have (2) and (3) subqueries properly aligned and highly likely be reused.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
Will have some follow up PRs (tried doing it in single PR, but turns out really hard to review and big change) to

  1. Refactor configs to unify and simplify all the results cache configs
  2. Refactor results cache metrics, to avoid lots of duplicates
  3. Simplify some protobuf definitions (particularly stats.proto)

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
    • If the change is worth mentioning in the release notes, add add-to-release-notes label
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
@kavirajk kavirajk force-pushed the kavirajk/cache-instant-queries2 branch from bb83a3a to 57d77a9 Compare January 30, 2024 10:55
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
@pull-request-size pull-request-size bot added size/XL and removed size/L labels Feb 2, 2024
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Fix test cases that failed with this changes

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
@kavirajk kavirajk changed the title feat(caching): Support caching for instant metric query results feat(caching): Support split align and caching for instant metric query results Feb 13, 2024
@kavirajk kavirajk marked this pull request as ready for review February 13, 2024 08:48
@kavirajk kavirajk requested a review from a team as a code owner February 13, 2024 08:48
@kavirajk kavirajk changed the title feat(caching): Support split align and caching for instant metric query results feat: Support split align and caching for instant metric query results Feb 13, 2024
pkg/querier/queryrange/downstreamer.go Outdated Show resolved Hide resolved
pkg/logql/rangemapper.go Outdated Show resolved Hide resolved
1. Update both start and end when removing offset
2. Unify subqueries generation in splitalign method

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Feb 16, 2024
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
pkg/validation/limits.go Outdated Show resolved Hide resolved
docs/sources/configure/_index.md Outdated Show resolved Hide resolved
pkg/querier/queryrange/instant_metric_cache.go Outdated Show resolved Hide resolved
pkg/querier/queryrange/instant_metric_cache.go Outdated Show resolved Hide resolved
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the detailed description and lots of tests!
Added a few comments, mostly nits. Good job Kavi!

docs/sources/configure/_index.md Show resolved Hide resolved
pkg/logql/rangemapper.go Outdated Show resolved Hide resolved
pkg/logql/rangemapper_test.go Show resolved Hide resolved
pkg/logqlmodel/stats/stats.proto Show resolved Hide resolved
pkg/querier/queryrange/codec.go Outdated Show resolved Hide resolved
docs/sources/configure/_index.md Show resolved Hide resolved
pkg/querier/queryrange/roundtrip.go Show resolved Hide resolved
pkg/logql/rangemapper.go Outdated Show resolved Hide resolved
pkg/logql/rangemapper.go Show resolved Hide resolved
@ashwanthgoli
Copy link
Contributor

I need to do another pass to review the tests, rest lgtm. Nice one @kavirajk ❤️

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
@kavirajk
Copy link
Contributor Author

Caching and split align in action

Do instant query for 3h range (first time)

-bash-5.2$ ./cmd/logcli/logcli instant-query 'sum(rate({job="varlogs"}[3h]))'
2024/02/20 08:18:37 http://localhost:3100/loki/api/v1/query?direction=BACKWARD&limit=30&query=sum%28rate%28%7Bjob%3D%22varlogs%22%7D%5B3h%5D%29%29&time=1708413517893995000
[
  {
    "metric": {},
    "value": [
      1708413517.893,
      "0.21814814814814815"
    ]
  }

How the query is split and cache reqs and cache hits (these logs lines are from metrics.go and engine.go on query-frontend and queriers)

Quey-frontend (saying 4 requests were actually made for cache, and none of those got hit)

latency=fast query="sum(rate({job=\"varlogs\"}[3h]))" query_hash=1737987035 cache_result_req=4 cache_result_hit=0

How the above query got split and run

msg="executing query" type=instant query="sum(count_over_time({job=\"varlogs\"}[1h]))" 
msg="executing query" type=instant query="sum(count_over_time({job=\"varlogs\"}[1h]))"
msg="executing query" type=instant query="sum(count_over_time({job=\"varlogs\"}[41m22s106ms]))" msg="executing query" type=instant query="sum(count_over_time({job=\"varlogs\"}[18m37s893ms]))" 

Do the instant query for 3h range (second time)

-bash-5.2$ ./cmd/logcli/logcli instant-query 'sum(rate({job="varlogs"}[3h]))'
2024/02/20 08:19:11 http://localhost:3100/loki/api/v1/query?direction=BACKWARD&limit=30&query=sum%28rate%28%7Bjob%3D%22varlogs%22%7D%5B3h%5D%29%29&time=1708413551869792000
[
  {
    "metric": {},
    "value": [
      1708413551.869,
      "0.02935185185185185"
    ]
  }

How the query is split and cache reqs and cache hits (these logs lines are from metrics.go and engine.go on query-frontend and queriers)

Quey-frontend (saying 4 requests were actually made for cache, 2 of those got hit)

latency=fast query="sum(rate({job=\"varlogs\"}[3h]))" query_hash=1737987035 cache_result_req=4 cache_result_hit=2

How the above query got split and run (only two queries that miss cache hit, got run this time)

msg="executing query" type=instant query="sum(count_over_time({job=\"varlogs\"}[19m11s869ms]))" 
msg="executing query" type=instant query="sum(count_over_time({job=\"varlogs\"}[40m48s130ms]))"

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Use right split duration (new InstantSplitDuration) for instant queries

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
@kavirajk kavirajk merged commit fac5997 into main Feb 20, 2024
9 checks passed
@kavirajk kavirajk deleted the kavirajk/cache-instant-queries2 branch February 20, 2024 10:09
rhnasc pushed a commit to inloco/loki that referenced this pull request Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants