-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lakectl log: add option to filter merge commits #8142
lakectl log: add option to filter merge commits #8142
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fearful PR in one of the most key areas of Graveler - well done.
In order for me to effectively review this PR, I'm going to request 2 things:
- Document the new logic in the commit iterator. It's hard to follow this delicate area logic change without this.
- Tests. First it seems that some tests fail in our CI (not sure if these are related or not). Also -
commit_iterator
needs some testing for the new functionality (perhaps more testing for existing functionality too), lakectl requires tests for the new flag, probably in esti.
pkg/graveler/ref/commit_iterator.go
Outdated
for ci.queue.Len() > 0 { | ||
ci.value = heap.Pop(&ci.queue).(*graveler.CommitRecord) | ||
|
||
// as long as we have something in the queue we will | ||
// set it as the current value and push the current commits parents to the queue | ||
ci.value = heap.Pop(&ci.queue).(*graveler.CommitRecord) | ||
parents := ci.value.Parents | ||
if ci.firstParent && len(parents) > 1 { | ||
parents = parents[:1] | ||
} | ||
for _, p := range parents { | ||
// skip commits we already visited | ||
if _, visited := ci.visit[p]; visited { | ||
if ci.noMerges && len(ci.value.Parents) > 1 { | ||
firstParent := ci.value.Parents[0] | ||
|
||
if _, visited := ci.visit[firstParent]; visited { | ||
continue | ||
} | ||
|
||
rec, err := ci.getCommitRecord(firstParent) | ||
if err != nil { | ||
ci.value = nil | ||
ci.err = err | ||
return false | ||
} | ||
ci.visit[rec.CommitID] = struct{}{} | ||
|
||
if ci.since != nil && rec.Commit.CreationDate.Before(*ci.since) { | ||
continue | ||
} | ||
heap.Push(&ci.queue, rec) | ||
continue | ||
} | ||
|
||
rec, err := ci.getCommitRecord(p) | ||
if err != nil { | ||
ci.value = nil | ||
ci.err = err | ||
return false | ||
parents := ci.value.Parents | ||
if ci.firstParent && len(parents) > 1 { | ||
parents = parents[:1] | ||
} | ||
ci.visit[rec.CommitID] = struct{}{} | ||
|
||
// skip commits that are older than since time | ||
if ci.since != nil && rec.Commit.CreationDate.Before(*ci.since) { | ||
continue | ||
for _, p := range parents { | ||
if _, visited := ci.visit[p]; visited { | ||
continue | ||
} | ||
|
||
rec, err := ci.getCommitRecord(p) | ||
if err != nil { | ||
ci.value = nil | ||
ci.err = err | ||
return false | ||
} | ||
ci.visit[rec.CommitID] = struct{}{} | ||
|
||
// skip commits that are older than since time | ||
if ci.since != nil && rec.Commit.CreationDate.Before(*ci.since) { | ||
continue | ||
} | ||
|
||
heap.Push(&ci.queue, rec) | ||
} | ||
|
||
heap.Push(&ci.queue, rec) | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive - One of the more delicate areas in Graveler and it seems really thought through.
For me, as a reviewer, it's near impossible to follow the logic that is applied here without comments walking me through the decisions taken in every branch of this code.
@ItamarYuran please add missing labels to PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking about possible incorrectness.
Also this code needs tests, I assume they'll be in the next commit on this PR so I'm blocking but you probably already know why.
With that it is the way, I would not like this code to be in the server, and this is my primary reason to block. A client like lakectl can and should filter on its side. There is no efficiency gain from filtering on the server side, most of the slowness comes from the server needing to iterate over many commits. It is also very easy to perform on the high-level python SDK.
Just iterate over calls to log-commits in the client and filter out commits with >1 parent. It will be simpler code and allow filtering as desired.
api/swagger.yml
Outdated
@@ -3540,6 +3540,11 @@ paths: | |||
description: if set to true, follow only the first parent upon reaching a merge commit | |||
schema: | |||
type: boolean | |||
- in: query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of parameters with negative names. They're mostly okay as commandline flag, less so in an api.
However here I have no gear alternative to suggest: the default has to remain with merges. If you do have a good alternative I'm very open to suggestions here.
Otherwise let's go with this one! If we keep on having to add flags, we'll anyway need to end up with an expression parser / GraphQL. So I'm willing to wait until we need another flag.
More importantly is why we need an API change if we just want to filter; that appears on the main comment for this review.
pkg/graveler/ref/commit_iterator.go
Outdated
if ci.noMerges && len(ci.value.Parents) > 1 { | ||
firstParent := ci.value.Parents[0] | ||
|
||
// case parent was visited, continuing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand.
- Here no-merges implies first-parent. Why? I would expect no-merges to do just one thing.
- How does this code work when there are 2 merges one after the other? firstParent Will also be a merge and the code should skip it.
- (Related) Just skip the current record of its a merge and keep going. This is a filter operation, let the next iteration do its thing
91be003
to
d2c2a5f
Compare
Thank You all for your reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, I really like this code!
There is some nastiness (an actual tech debt, even a bug) in the original code, that became bad when we add filtering. If you specify --amount
(the case amount != 0
on l. 177) then the original code will never return >1000 results, even if amount == 2000
. The new code compounds this: if the first page does not contain even amount
results, then it will perform a single call. So it will display < amount
results. For instance, say I specify --amount=10
. Then the code asks for 10 log entries, and displays only those which are not merges. So it will often fail to display exactly 10 results.
I would suggest:
- Change the loop to reduce
amount
by the number of results actually output.- Exit the loop if we get to
amount <= 0
.
- Exit the loop if we get to
- When setting up initial pagination, ask for
amount
results if not filtering, or3*amount
results if we are filtering. This is a heuristic: it guesses that at most 2/3 commits will be merges, so it asks for more. But it always works even if it guessed wrong - it just calls lakeFS more times than it could have.
WDYT?
cmd/lakectl/cmd/log.go
Outdated
@@ -66,6 +66,19 @@ func (d *dotWriter) Write(commits []apigen.Commit) { | |||
} | |||
} | |||
|
|||
// filter merge commits, used for --no-merges flag | |||
func filterMergeCommits(commits []apigen.Commit) []apigen.Commit { | |||
var filteredCommits []apigen.Commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var filteredCommits []apigen.Commit | |
filteredCommits := make([]apigen.Commit, 0, len(commits)) |
This is fairly idiomatic Go for when you know roughly the expected length of the result. It says to reserve space for the same number of commits, but keep the slice itself at length 0.
@arielshaqed As a result of using the heuristic, while using both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Some very good progress here. Besides my comments, we're missing some tests coverage for the new flag
cmd/lakectl/cmd/common_helpers.go
Outdated
@@ -40,6 +40,7 @@ const ( | |||
const ( | |||
internalPageSize = 1000 // when retrieving all records, use this page size under the hood | |||
defaultAmountArgumentValue = 100 // when no amount is specified, use this value for the argument | |||
maxAmountNoMerges = 333 // when using --no-merges & amount, this const is the upper limit use huristic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rephrase, I don't understand what this means and why 333(?!)
cmd/lakectl/cmd/log.go
Outdated
@@ -100,6 +114,10 @@ var logCmd = &cobra.Command{ | |||
if amountForPagination <= 0 { | |||
amountForPagination = internalPageSize | |||
} | |||
// case --no-merges & --amount, ask for more results since some will filter out | |||
if noMerges && amountForPagination < maxAmountNoMerges { | |||
amountForPagination *= 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I get the 333
- it's directly linked to 3. I wonder how our linter doesn't complain about this magic number.
cmd/lakectl/cmd/log.go
Outdated
// case --no-merges, filter commits and subtract that amount from amount desired | ||
if noMerges { | ||
data.Commits = filterMergeCommits(data.Commits) | ||
amount = amount - (3 * len(data.Commits)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it tripled here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was that if we use that heuristic, in this way we stay true to it. We asked for three times the amount we needed to get ~1/3 results of that. Updating the amount this way will keep that heuristic in every request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better if we have a single location to handle amount heuristics. Can you try to combine this block and the above one (lines 114-120) into one? Preferably where the loop starts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong here, works better with subtracting only the original size. subtracting X3 amount caused receiving less results than asked.
@ItamarYuran Can we refactor the PR title from: "7441 give an option to hide merge commits in a branch in lakectl" |
♻️ PR Preview 91df0b1 has been successfully destroyed since this PR has been closed. 🤖 By surge-preview |
@arielshaqed It works! at least it seems like. Failures occurring because of some other reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... Failures occurring because of some other reason
Unfortunately all Estis seem to fail at the same spot. If we pull this our CI will break. So it doesn't matter if the bad test code was added here or not, I cannot approve.
What can we do?
First, skip your added test (add t.Skip()
at the beginning) and get Esti to run on it again. Does it still fail?
If it stopped failing, we need to look at the added test - or perhaps t.Skip() another test to see if it's just one extra test which breaks everything. If it still fails, we will need to understand why (whether) the trunk fails, and open an issue to fix trunk.
Sorry.
cmd/lakectl/cmd/common_helpers.go
Outdated
@@ -41,6 +41,10 @@ const ( | |||
internalPageSize = 1000 // when retrieving all records, use this page size under the hood | |||
defaultAmountArgumentValue = 100 // when no amount is specified, use this value for the argument | |||
|
|||
// when using --no-merges & amount, this const is the upper limit to use huristic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
// when using --no-merges & amount, this const is the upper limit to use huristic. | |
// when using --no-merges & amount, this const is the upper limit to use heuristic. |
cmd/lakectl/cmd/common_helpers.go
Outdated
@@ -41,6 +41,10 @@ const ( | |||
internalPageSize = 1000 // when retrieving all records, use this page size under the hood | |||
defaultAmountArgumentValue = 100 // when no amount is specified, use this value for the argument | |||
|
|||
// when using --no-merges & amount, this const is the upper limit to use huristic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment also says "3". So that constant needs to appear here, too.
esti/lakectl_test.go
Outdated
@@ -338,6 +338,59 @@ func TestLakectlMergeAndStrategies(t *testing.T) { | |||
RunCmdAndVerifySuccessWithFile(t, Lakectl()+" fs ls lakefs://"+repoName+"/"+featureBranch+"/", false, "lakectl_fs_ls_1_file", vars) | |||
} | |||
|
|||
func TestLakectlLogNoMergesWithCommitsAndMerges(t *testing.T) { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: please remove this blank line, it does not aid readability.
Note that these runs on master all seem to succeed; I started another one right now, but I would expect it also to succeed. |
I also re-ran Esti on this branch. This will give us more confidence whether the problem is on trunk or only on this branch. |
I think this means that Esti actually discovered an issue with the code. It is getting a wrong result when logging after a merge. 🙀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last nit found... and we should be done!
} | ||
amount -= len(data.Commits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this still prints an incorrect number of commits. Think of the case where you get more commits after filtering than you really needed. Something like data.Commits = data.Commits[:amount]
should do the job.
Please also add a test that would fail without this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, great feature!
Closes #7441
Added an option to filter merge commits in the log command. The flag
--no-merges
will filter every commit that has more than one parent and thus also import commits. The implementation concerns only lakectl at the moment. Filtering is done at The server's side similarly to--first-parent
flag.Testing Details
Tested only manually, original test pass (after a slight modification)
Additional info
Contact Details
itamar.yuran@treeverse.io