Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pad 134 #254

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Pad 134 #254

wants to merge 4 commits into from

Conversation

will-HPE
Copy link
Contributor

@will-HPE will-HPE commented Apr 8, 2024

Tensorboard and GPU kernel build fixes (PAD-91 and PAD-134)

  • fix: bug fix
  1. The latest Tensorboard version brought down has issues so we set the required version to one known to work (PAD-91); and
  2. During GPU kernel build the cleanup fails throwing an error (PAD-134) we set an environmental variable prevent the cleanup of directories that are not empty.

Checklist

  • [✓ ] Bump VERSION to make the pushed images are tagged with the right version.
  • Licenses should be included for new code which was copied and/or modified from any external code.
  • Test the images by running the test bumpenvs procedure in the determined repo. See README.

@cla-bot cla-bot bot added the cla-signed label Apr 8, 2024
@will-HPE will-HPE requested review from ioga and karlonw April 8, 2024 20:58
Copy link
Contributor

@karlonw karlonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@@ -10,7 +10,7 @@ Pillow>=8.3.2,<=9.5.0
analytics-python
nvidia-ml-py
protobuf<=3.20.3
tensorboard==2.10.1
tensorboard==1.15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hold up, this can't be right, this is a tensorboard version for tensorflow v1 from 2019.
can we have a version that is more modern?

Copy link
Contributor Author

@will-HPE will-HPE Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can test a different version but 2.10.1 fails on end-to-end testing and was imortalized in jira PAD-91:
https://hpe-aiatscale.atlassian.net/browse/PAD-91

103_run_mlde_validation_suite_against_rocm_on_grenoble/determined-...
tests/nightly/test_pytorch2.py::test_pytorch2_hf_language_modeling_distributed FAILED [100%]

The reason for the failure is:

ImportError: TensorBoard logging requires TensorBoard version 1.15 or above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a strong opinion we cannot pin 1.15 here because it's 5 years old and likely to conflict with other dependencies and have CVEs.
2.10.1 is also technically above 1.15...

VERSION Outdated
@@ -1 +1 @@
0.30.1
0.31.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've just released 0.30.0; if it'll land (and get into bumpenvs) by EOD today, you can keep it at 0.30.1. otherwise it'll probably be 0.30.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants