Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer tests and TokenizeLine updates #11133

Merged
merged 16 commits into from
Nov 8, 2023
Merged

Tokenizer tests and TokenizeLine updates #11133

merged 16 commits into from
Nov 8, 2023

Conversation

paul1r
Copy link
Collaborator

@paul1r paul1r commented Nov 3, 2023

What this PR does / why we need it:
The thrust of this PR is to ensure we have tests for each major function of the Bloom Tokenizer. In addition, there was some cleanup, in that constants are used to set some common parameters.

Lastly, the TokenizeLine() call was updated to correctly tokenize a line when a "skip tokenizer" is utilized.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
    • If the change is worth mentioning in the release notes, add add-to-release-notes label
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@paul1r paul1r requested a review from a team as a code owner November 3, 2023 18:05
Copy link
Contributor

@poyzannur poyzannur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor question on the use of the two methods you added.
Am i correct to assume TokenizeLineWithChunkPrefix should be used when writing blooms, whereas TokenizeLine should be used to tokenize search queries before querying the blooms?

@paul1r
Copy link
Collaborator Author

paul1r commented Nov 8, 2023

LGTM with a minor question on the use of the two methods you added. Am i correct to assume TokenizeLineWithChunkPrefix should be used when writing blooms, whereas TokenizeLine should be used to tokenize search queries before querying the blooms?

PopulateSeriesWithBloom would be used when writing blooms. TokenizeLine is just to get a quick set of tokens, for the quick sniff test if we need to dig into a chunk or not. The TokenizeLineWithChunkPrefix is to validate those tokens exist for a specific chunk. Should be more obvious once we start wiring this together

owen-d
owen-d previously requested changes Nov 8, 2023
pkg/storage/bloom/v1/bloom_tokenizer.go Outdated Show resolved Hide resolved
pkg/storage/bloom/v1/bloom_tokenizer.go Outdated Show resolved Hide resolved
pkg/storage/bloom/v1/bloom_tokenizer.go Outdated Show resolved Hide resolved
pkg/storage/bloom/v1/bloom_tokenizer.go Outdated Show resolved Hide resolved
pkg/storage/bloom/v1/bloom_tokenizer_test.go Outdated Show resolved Hide resolved
@owen-d owen-d dismissed their stale review November 8, 2023 19:23

Realized this is for reads, not writes.

// If the tokenizer has a skip value, then the line will be tokenized multiple times,
// starting at the beginning of the line, with "skip" number of iterations, offset by one each time
// Each offset is kept as a separate slice of tokens, and all are returned in a slice of slices
func (bt *BloomTokenizer) TokenizeLineWithChunkPrefix(line string, chk logproto.ChunkRef) [][]Token {
Copy link
Member

@owen-d owen-d Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need for two functions here -- you can just use something like the following and apply it to any tokenizer (chunk_prefix_tokenizer or regular)

func SearchesForTokenizerAndLine(t Tokenizer, line string) (res [][]Token) {
  for i := 0; i < t.Skip()+1; i++ {
    res = append(res, t.Tokens(line[i:])) // this needs to account for runes vs bytes, but you get the idea 
  }
  return
}

func (bt *BloomTokenizer) TokenizeLineWithChunkPrefix(line string, chk logproto.ChunkRef) [][]Token {
allTokens := make([][]Token, 0, 10)

if len(line) >= bt.chunkIDTokenizer.GetMin() && len(line) >= bt.chunkIDTokenizer.GetSkip() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  • This actually needs to ensure the length is >= min + skip.
  • len(str) doesn't return the number of runes, but the number of bytes. We need to account for runes since that's how we index. See this for more detail.

Comment on lines 124 to 125
// This is a multi-dimensional slice where the first slice is the offset into the line, and the
// second slice is the tokens for that offset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true if all of the skip offsets return at least one token. Otherwise, the length of the result will be less than the number of skips+1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I've updated the doc accordingly, good catch

// The offset is used if the Tokenizer has a skip value being utilized.
func SearchesForTokenizerAndLine(t Tokenizer, line string) (res [][]Token) {
res = make([][]Token, 0, 10)
for i := range line { // iterate by runes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unnecessarily iterates all runes in the line, including offsets beyond skip+1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, added a break clause

@owen-d owen-d merged commit c4ed0d0 into main Nov 8, 2023
5 checks passed
@owen-d owen-d deleted the paul1r/bloom_updates branch November 8, 2023 22:03
rhnasc pushed a commit to inloco/loki that referenced this pull request Apr 12, 2024
**What this PR does / why we need it**:
The thrust of this PR is to ensure we have tests for each major function
of the Bloom Tokenizer. In addition, there was some cleanup, in that
constants are used to set some common parameters.

Lastly, the TokenizeLine() call was updated to correctly tokenize a line
when a "skip tokenizer" is utilized.

**Which issue(s) this PR fixes**:
Fixes #<issue number>

**Special notes for your reviewer**:

**Checklist**
- [ ] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [ ] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] If the change is worth mentioning in the release notes, add
`add-to-release-notes` label
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/setup/upgrade/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](grafana@d10549e)
- [ ] If the change is deprecating or removing a configuration option,
update the `deprecated-config.yaml` and `deleted-config.yaml` files
respectively in the `tools/deprecated-config-checker` directory.
[Example
PR](grafana@0d4416a)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants