You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, everybody!
I am getting the following error while training a model on single GPU on Linux Ubuntu 22.04. I am new to Linux and training on local GPU. I am starting to run it in docker with the following command: ./train.sh
This is the error I receive:
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_qgm2o66y/none_xv9bcqsf/attempt_3/0/error.json
After it is printed, it is stuck for couple minutes and the epoch starts and then it fails with another error (input tensor is empty, but it is not empty when I print it), which I guess, raises because of the first error. Can anybody help me with this error?
The text was updated successfully, but these errors were encountered:
Hi, everybody!
I am getting the following error while training a model on single GPU on Linux Ubuntu 22.04. I am new to Linux and training on local GPU. I am starting to run it in docker with the following command: ./train.sh
This is the error I receive:
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_qgm2o66y/none_xv9bcqsf/attempt_3/0/error.json
After it is printed, it is stuck for couple minutes and the epoch starts and then it fails with another error (input tensor is empty, but it is not empty when I print it), which I guess, raises because of the first error. Can anybody help me with this error?
The text was updated successfully, but these errors were encountered: