Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pad 134 #254

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Dockerfile-default-rocm
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,10 @@ ENV WITH_RCCL=$WITH_RCCL
ARG WITH_NFS_WORKAROUND=1
ENV WITH_NFS_WORKAROUND=$WITH_NFS_WORKAROUND

#MIOPEN_DEBUG_SAVE_TEMP_DIR is required to prevent
# PAD-134
ENV MIOPEN_DEBUG_SAVE_TEMP_DIR=1

ENTRYPOINT ["/container/bin/scrape_libs.sh"]
CMD ["/bin/bash"]
USER root
Expand Down
2 changes: 1 addition & 1 deletion dockerfile_scripts/additional-requirements-rocm.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Pillow>=8.3.2,<=9.5.0
analytics-python
nvidia-ml-py
protobuf<=3.20.3
tensorboard==2.10.1
tensorboard==1.15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hold up, this can't be right, this is a tensorboard version for tensorflow v1 from 2019.
can we have a version that is more modern?

Copy link
Contributor Author

@will-HPE will-HPE Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can test a different version but 2.10.1 fails on end-to-end testing and was imortalized in jira PAD-91:
https://hpe-aiatscale.atlassian.net/browse/PAD-91

103_run_mlde_validation_suite_against_rocm_on_grenoble/determined-...
tests/nightly/test_pytorch2.py::test_pytorch2_hf_language_modeling_distributed FAILED [100%]

The reason for the failure is:

ImportError: TensorBoard logging requires TensorBoard version 1.15 or above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a strong opinion we cannot pin 1.15 here because it's 5 years old and likely to conflict with other dependencies and have CVEs.
2.10.1 is also technically above 1.15...

pynvml
tokenizers==0.13.0
huggingface-hub==0.16.4