-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer tests and TokenizeLine updates #11133
Conversation
…. Add test cases for chunkID prefix work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a minor question on the use of the two methods you added.
Am i correct to assume TokenizeLineWithChunkPrefix
should be used when writing blooms, whereas TokenizeLine
should be used to tokenize search queries before querying the blooms?
PopulateSeriesWithBloom would be used when writing blooms. TokenizeLine is just to get a quick set of tokens, for the quick sniff test if we need to dig into a chunk or not. The TokenizeLineWithChunkPrefix is to validate those tokens exist for a specific chunk. Should be more obvious once we start wiring this together |
// If the tokenizer has a skip value, then the line will be tokenized multiple times, | ||
// starting at the beginning of the line, with "skip" number of iterations, offset by one each time | ||
// Each offset is kept as a separate slice of tokens, and all are returned in a slice of slices | ||
func (bt *BloomTokenizer) TokenizeLineWithChunkPrefix(line string, chk logproto.ChunkRef) [][]Token { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need for two functions here -- you can just use something like the following and apply it to any tokenizer (chunk_prefix_tokenizer or regular)
func SearchesForTokenizerAndLine(t Tokenizer, line string) (res [][]Token) {
for i := 0; i < t.Skip()+1; i++ {
res = append(res, t.Tokens(line[i:])) // this needs to account for runes vs bytes, but you get the idea
}
return
}
func (bt *BloomTokenizer) TokenizeLineWithChunkPrefix(line string, chk logproto.ChunkRef) [][]Token { | ||
allTokens := make([][]Token, 0, 10) | ||
|
||
if len(line) >= bt.chunkIDTokenizer.GetMin() && len(line) >= bt.chunkIDTokenizer.GetSkip() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
- This actually needs to ensure the length is
>= min + skip
. - len(str) doesn't return the number of
runes
, but the number of bytes. We need to account forrunes
since that's how we index. See this for more detail.
// This is a multi-dimensional slice where the first slice is the offset into the line, and the | ||
// second slice is the tokens for that offset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only true if all of the skip offsets return at least one token. Otherwise, the length of the result will be less than the number of skips+1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I've updated the doc accordingly, good catch
// The offset is used if the Tokenizer has a skip value being utilized. | ||
func SearchesForTokenizerAndLine(t Tokenizer, line string) (res [][]Token) { | ||
res = make([][]Token, 0, 10) | ||
for i := range line { // iterate by runes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unnecessarily iterates all runes in the line, including offsets beyond skip+1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, added a break clause
**What this PR does / why we need it**: The thrust of this PR is to ensure we have tests for each major function of the Bloom Tokenizer. In addition, there was some cleanup, in that constants are used to set some common parameters. Lastly, the TokenizeLine() call was updated to correctly tokenize a line when a "skip tokenizer" is utilized. **Which issue(s) this PR fixes**: Fixes #<issue number> **Special notes for your reviewer**: **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] If the change is worth mentioning in the release notes, add `add-to-release-notes` label - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/setup/upgrade/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](grafana@d10549e) - [ ] If the change is deprecating or removing a configuration option, update the `deprecated-config.yaml` and `deleted-config.yaml` files respectively in the `tools/deprecated-config-checker` directory. [Example PR](grafana@0d4416a)
What this PR does / why we need it:
The thrust of this PR is to ensure we have tests for each major function of the Bloom Tokenizer. In addition, there was some cleanup, in that constants are used to set some common parameters.
Lastly, the TokenizeLine() call was updated to correctly tokenize a line when a "skip tokenizer" is utilized.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Checklist
CONTRIBUTING.md
guide (required)CHANGELOG.md
updatedadd-to-release-notes
labeldocs/sources/setup/upgrade/_index.md
production/helm/loki/Chart.yaml
and updateproduction/helm/loki/CHANGELOG.md
andproduction/helm/loki/README.md
. Example PRdeprecated-config.yaml
anddeleted-config.yaml
files respectively in thetools/deprecated-config-checker
directory. Example PR