Error in training #218

linc4ekk · 2022-06-27T17:22:19Z

Hi, everybody!
I am getting the following error while training a model on single GPU on Linux Ubuntu 22.04. I am new to Linux and training on local GPU. I am starting to run it in docker with the following command: ./train.sh

This is the error I receive:
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_qgm2o66y/none_xv9bcqsf/attempt_3/0/error.json

After it is printed, it is stuck for couple minutes and the epoch starts and then it fails with another error (input tensor is empty, but it is not empty when I print it), which I guess, raises because of the first error. Can anybody help me with this error?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in training #218

Error in training #218

linc4ekk commented Jun 27, 2022

Error in training #218

Error in training #218

Comments

linc4ekk commented Jun 27, 2022