Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for sdxl pipeline (testing) #152

Merged
merged 24 commits into from
Apr 4, 2024
Merged

Support for sdxl pipeline (testing) #152

merged 24 commits into from
Apr 4, 2024

Conversation

saienduri
Copy link
Contributor

@saienduri saienduri commented Apr 3, 2024

This commit adds support for testing all sdxl submodels

iree_tests/configs/config_sdxl_cpu_llvm_task.json Outdated Show resolved Hide resolved
Comment on lines 342 to 347
def test_benchmark(self):
proc = subprocess.run(self.benchmark_args, capture_output=True, cwd=self.test_cwd)
if proc.returncode != 0:
raise IreeRunException(proc, self.test_cwd, self.compile_args)
outs = proc.stdout.decode("utf-8")
print(f"Stdout benchmark:\n{outs}\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this sort of change lands, let's think a bit about what we actually want coverage for. I'm skeptical about having benchmarks built into the same testing flow... though the suite that Stella set up has these:

Copy link
Contributor Author

@saienduri saienduri Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because we care so much about sdxl perf, I think it would be great to have it included in this flow. I didn't look into adding a whole separate flow for it or making it very scalable because I doubt we will be adding benchmarks for anything else. That's why also just went with hardcoded commands (also flags have to be in conftest.py because some flag values are path names that are relative to other directories that we figure out there). I was thinking we can evaluate and iterate if this becomes a bigger utility but for now, just went for easiest/simple add for benchmarking. Here is an example log with everything running: https://github.com/nod-ai/SHARK-TestSuite/actions/runs/8543035287/job/23405926540

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of continuous benchmarks is clear, but I want to be careful about how we integrate them. For this PR, can you leave the benchmarks off and focus on just adding the sdxl models? A follow-up PR can then add benchmarking.

I'd at least like to take some time to work through the specific requirements before jumping straight to an implementation. For example:

  • What metrics/artifacts do we want from benchmarking?
    • Each model in isolation? Full pipeline latency? Just dispatch time?
  • What do we want done with benchmark results / artifacts?
    • The in-tree benchmarks in IREE submit results to a dashboard (that should use a queryable database...), upload Tracy files to cloud storage, and comment on pending pull requests with results summaries
  • Where do we want benchmarks to run?
    • Right after tests, on presubmit to IREE?
    • In a separate job, on separate runners?

I'm also wondering if we want to use pytest as the benchmark runner (either with the existing conftest.py or a forked one), or if we would want to use another runner (could start with a pile of scripts, just using the same test suite source files)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be reasonable to start with pytest -k benchmark or pytest iree_tests/benchmarks that just run iree-benchmark-module instead of iree-run-module and then let developers dig through the GitHub Actions logs to see results, but I'm worried about going down the path of building an entirely new benchmark "framework" when we already have https://github.com/openxla/iree-comparative-benchmark and https://github.com/openxla/iree/tree/main/build_tools/benchmarks (building something new is likely going to make sense, at least in the short term, but this stuff gets complicated very quickly)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this stuff gets complicated very quickly

For example: https://github.com/openxla/community/blob/main/rfcs/20230505-benchmarking-strategy.md

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(think only way to split by runner is having them on different jobs).

I believe that can be done with a matrix too: https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs#example-using-a-multi-dimension-matrix

Could then check other matrix parameters to choose which steps to run... maybe like this:

jobs:
  iree_tests:
    strategy:
      matrix:
        configuration:
          - runner: ubuntu-22.04
            test_command: "pytest iree_tests/simple --durations=0"
          - runner: ubuntu-22.04
            test_command: "pytest iree_tests/onnx/node/generated -n auto -rpfE --timeout=30 --retries 2 --retry-delay 5 --durations=10"
    runs-on: ${{ matrix.runner }}
    steps:
      ...

I'm also referencing https://github.com/openxla/iree/blob/573ff1ff02347266ed747dd316cefaeb4c710396/.github/workflows/ci.yml#L749-L784 (probably tons of other files to reference across GitHub, but that's what I know already...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to plug in to the in-tree benchmark infrastructure that IREE has, we'd want a PR like iree-org/iree#16965 . That would feed into https://perf.iree.dev/ and PR comments, but it doesn't also test correctness out of the box, can be tricky to update (multiple Python files, coupled with GitHub Actions), and puts input files / parameters behind a few levels of abstraction that make it harder to run locally.

Copy link
Contributor Author

@saienduri saienduri Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also wonder how that would work because we need to essentially compile multiple submodels and then use their vmfbs for the pipeline's vmfb. Not sure if it is setup for a pipeline structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I've placed the onnx and model tests in different jobs. I think that's best for this suite. Because they don't depend on each other and are running independently on different machines, I don't think we need the sequential steps. This way we have parallel execution which can help for scalability in future. Once we get more machines, splitting on models also would be great :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR for benchmarking that should be landed after this one: #155. Feel free to add on notes there for future reference

@saienduri saienduri changed the title Support for sdxl pipeline (testing + benchmarking) Support for sdxl pipeline (testing) Apr 4, 2024
Comment on lines 105 to +109
# TODO(scotttodd): add a local cache for these large files to a persistent runner
- name: "Downloading remote files for real weight model tests"
run: |
source ${VENV_DIR}/bin/activate
python3 iree_tests/download_remote_files.py
python3 iree_tests/download_remote_files.py --root-dir pytorch/models
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof this is taking 12 minutes to download: https://github.com/nod-ai/SHARK-TestSuite/actions/runs/8546974154/job/23418304591?pr=152#step:7:14

We'll really want that to be cached for presubmit.

.github/workflows/test_iree.yml Outdated Show resolved Hide resolved
.github/workflows/test_iree.yml Outdated Show resolved Hide resolved
.github/workflows/test_iree.yml Outdated Show resolved Hide resolved
iree_tests/configs/config_sdxl_cpu_llvm_task.json Outdated Show resolved Hide resolved
iree_tests/configs/config_sdxl_cpu_llvm_task.json Outdated Show resolved Hide resolved
Comment on lines +24 to +29
]
}
]
}
]
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing more of these files, I'm still thinking about how to keep them easy to update. Might refactor to separate JSON files like

test_case_splats.json:

{
  "name": "splats",
  "runtime_flagfile": "splat_data_flags.txt",
  "remote_files": []
}

test_case_real_weights.json:

{
  "name": "real_weights",
  "runtime_flagfile": "real_weights_data_flags.txt",
  "remote_files": [
    "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_input.0.bin",
    "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_input.1.bin",
    "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_input.2.bin",
    "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_input.3.bin",
    "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_output.0.bin",
    "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_output.1.bin",
    "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/real_weights.irpa"
  ]
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, this is probably easier to decode/update

Comment on lines 78 to 115
pytest iree_tests -n auto -k real_weights -rpfE --timeout=600 --retries 2 --retry-delay 5 --durations=0
pytest iree_tests/pytorch/models -s -n auto -k real_weights -rpfE --timeout=1200 --retries 2 --retry-delay 5 --durations=0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't want -n auto. That's using 64 workers: https://github.com/nod-ai/SHARK-TestSuite/actions/runs/8546974154/job/23418304591?pr=152#step:8:23

It miiiight work out, but multiple large program compiles will compete for the CPU and multiple large program runs will compete for the GPU. I guess ideally the compiler could be using the CPU while the runtime is using the GPU. We also need to be very careful with that for benchmarks - we don't want one benchmark influencing another.

One thing we can do is have multi-stage builds where a CPU machine runs the compiler and a GPU machine runs the tests/benchmarks. IREE's in-tree benchmarks follow that model. As long as the large model weights are externalized into parameter files and the .vmfb files are small, we can limit where we upload/download large files to just where they are needed. If we want to get really fancy, we could spin up GPU machines at the same time as CPU machines and have the GPU machines start to fetch weights (orrr just have them cached already) while the CPU machines work on compiling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I removed -n auto on the benchmarking part for that reason. But, sure, we can remove it here also and figure out how we want to proceed in the future. A pipeline on the runner side of things would be cool. The cpu machines compile and pass the vmfbs to the gpu runners to run, but also might just be making things complicated. I'm thinking in terms of how long the CI run will take, instead of 64, do you think lowering it into 4 or something would work? (or is that also too much competition)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've been using 4.

@saienduri
Copy link
Contributor Author

@ScottTodd is this good to merge now?

@saienduri saienduri requested a review from ScottTodd April 4, 2024 19:55
@saienduri saienduri merged commit 5596006 into main Apr 4, 2024
2 checks passed
@saienduri saienduri deleted the sdxl_pipeline_sai_v2 branch April 4, 2024 20:24
saienduri added a commit that referenced this pull request Apr 8, 2024
This commit adds the V0 support for testing all sdxl submodels. This PR
is not a long term solution for benchmarking as Scott and I discussed
here: #152. This is the
result of a request to get sdxl benchmarking in ASAP by our team. Due to
the high priority for this to be added to the sdxl testing as our team
lands patches in IREE, this is simply just to get the implementation in
and working. Scott and I discussed some more intensive and well
structured ways to add benchmarking, which either of us may implement in
the future.

Also, this PR depends on this one in terms of landing:
#152 (hence the CI
failure)

Notes for future if we decide that we need a stronger implementation: 

1. Maybe something like iree-org/iree#16965 which
will feed into https://perf.iree.dev/.
2. This is the benchmarking framework we already have:
https://github.com/openxla/iree-comparative-benchmark and
https://github.com/openxla/iree/tree/main/build_tools/benchmarks
3. Some questions Scott raised to keep in mind for future
implementation:
* What metrics/artifacts do we want from benchmarking?
  * Each model in isolation? Full pipeline latency? Just dispatch time?
* What do we want done with benchmark results / artifacts?
* The in-tree benchmarks in IREE submit results to a dashboard (that
should use a queryable database...), upload Tracy files to cloud
storage, and comment on pending pull requests with results summaries
* Where do we want benchmarks to run?
  * Right after tests, on presubmit to IREE?
  * In a separate job, on separate runners?
If we decide benchmarking needs changes, we will address all of these
and come up with a more structured, methodical implementation that
either creates a new benchmarking flow here or plugs into the iree
benchmarking setup.
renxida pushed a commit that referenced this pull request Jul 18, 2024
This commit adds support for testing all sdxl submodels
renxida pushed a commit that referenced this pull request Jul 18, 2024
This commit adds the V0 support for testing all sdxl submodels. This PR
is not a long term solution for benchmarking as Scott and I discussed
here: #152. This is the
result of a request to get sdxl benchmarking in ASAP by our team. Due to
the high priority for this to be added to the sdxl testing as our team
lands patches in IREE, this is simply just to get the implementation in
and working. Scott and I discussed some more intensive and well
structured ways to add benchmarking, which either of us may implement in
the future.

Also, this PR depends on this one in terms of landing:
#152 (hence the CI
failure)

Notes for future if we decide that we need a stronger implementation: 

1. Maybe something like iree-org/iree#16965 which
will feed into https://perf.iree.dev/.
2. This is the benchmarking framework we already have:
https://github.com/openxla/iree-comparative-benchmark and
https://github.com/openxla/iree/tree/main/build_tools/benchmarks
3. Some questions Scott raised to keep in mind for future
implementation:
* What metrics/artifacts do we want from benchmarking?
  * Each model in isolation? Full pipeline latency? Just dispatch time?
* What do we want done with benchmark results / artifacts?
* The in-tree benchmarks in IREE submit results to a dashboard (that
should use a queryable database...), upload Tracy files to cloud
storage, and comment on pending pull requests with results summaries
* Where do we want benchmarks to run?
  * Right after tests, on presubmit to IREE?
  * In a separate job, on separate runners?
If we decide benchmarking needs changes, we will address all of these
and come up with a more structured, methodical implementation that
either creates a new benchmarking flow here or plugs into the iree
benchmarking setup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants