Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for sdxl pipeline (testing) #152
Support for sdxl pipeline (testing) #152
Changes from 11 commits
c2a0641
c653d63
ec73eb7
60fab44
13199b9
2a40a5b
32e6b77
fbb07bc
7cebe0c
8e5bc62
a78eeda
b731fe7
ce347f7
ede8a9e
5eb2e7f
93a6b62
c7b3b3d
0684a7f
94494cc
015013b
99fcfe9
351f673
f99a50b
043513e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this sort of change lands, let's think a bit about what we actually want coverage for. I'm skeptical about having benchmarks built into the same testing flow... though the suite that Stella set up has these:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because we care so much about sdxl perf, I think it would be great to have it included in this flow. I didn't look into adding a whole separate flow for it or making it very scalable because I doubt we will be adding benchmarks for anything else. That's why also just went with hardcoded commands (also flags have to be in conftest.py because some flag values are path names that are relative to other directories that we figure out there). I was thinking we can evaluate and iterate if this becomes a bigger utility but for now, just went for easiest/simple add for benchmarking. Here is an example log with everything running: https://github.com/nod-ai/SHARK-TestSuite/actions/runs/8543035287/job/23405926540
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of continuous benchmarks is clear, but I want to be careful about how we integrate them. For this PR, can you leave the benchmarks off and focus on just adding the sdxl models? A follow-up PR can then add benchmarking.
I'd at least like to take some time to work through the specific requirements before jumping straight to an implementation. For example:
I'm also wondering if we want to use pytest as the benchmark runner (either with the existing conftest.py or a forked one), or if we would want to use another runner (could start with a pile of scripts, just using the same test suite source files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be reasonable to start with
pytest -k benchmark
orpytest iree_tests/benchmarks
that just runiree-benchmark-module
instead ofiree-run-module
and then let developers dig through the GitHub Actions logs to see results, but I'm worried about going down the path of building an entirely new benchmark "framework" when we already have https://github.com/openxla/iree-comparative-benchmark and https://github.com/openxla/iree/tree/main/build_tools/benchmarks (building something new is likely going to make sense, at least in the short term, but this stuff gets complicated very quickly)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example: https://github.com/openxla/community/blob/main/rfcs/20230505-benchmarking-strategy.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that can be done with a matrix too: https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs#example-using-a-multi-dimension-matrix
Could then check other matrix parameters to choose which steps to run... maybe like this:
I'm also referencing https://github.com/openxla/iree/blob/573ff1ff02347266ed747dd316cefaeb4c710396/.github/workflows/ci.yml#L749-L784 (probably tons of other files to reference across GitHub, but that's what I know already...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to plug in to the in-tree benchmark infrastructure that IREE has, we'd want a PR like iree-org/iree#16965 . That would feed into https://perf.iree.dev/ and PR comments, but it doesn't also test correctness out of the box, can be tricky to update (multiple Python files, coupled with GitHub Actions), and puts input files / parameters behind a few levels of abstraction that make it harder to run locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also wonder how that would work because we need to essentially compile multiple submodels and then use their vmfbs for the pipeline's vmfb. Not sure if it is setup for a pipeline structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I've placed the onnx and model tests in different jobs. I think that's best for this suite. Because they don't depend on each other and are running independently on different machines, I don't think we need the sequential steps. This way we have parallel execution which can help for scalability in future. Once we get more machines, splitting on models also would be great :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR for benchmarking that should be landed after this one: #155. Feel free to add on notes there for future reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing more of these files, I'm still thinking about how to keep them easy to update. Might refactor to separate JSON files like
test_case_splats.json:
test_case_real_weights.json:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah, this is probably easier to decode/update