单机多卡报错 #6311

122550888 · 2024-12-11T09:41:57Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-5.15.0-125-generic-x86_64-with-glibc2.35
Python version: 3.11.7
PyTorch version: 2.5.1+cu124 (GPU)
Transformers version: 4.46.1
Datasets version: 3.1.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: Tesla M10
Bitsandbytes version: 0.45.0

Reproduction

FORCE_TORCHRUN=1 CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train
--stage sft
--do_train True
--model_name_or_path /home/localadmin/.cache/modelscope/hub/LLM-Research/Llama-3.2-3B
--preprocessing_num_workers 2
--finetuning_type lora
--template llama3
--flash_attn auto
--dataset_dir /home/localadmin/LLaMA-Factory/data
--dataset train
--cutoff_len 2048
--learning_rate 0.0001
--num_train_epochs 20.0
--max_samples 100000
--per_device_train_batch_size 2
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 5
--save_steps 1000
--warmup_steps 0
--packing False
--report_to none
--output_dir /home/localadmin/LLaMA-Factory/saves/Llama-3.2-3B/train_2024-12-06-09-40-15
--bf16 True
--plot_loss True
--ddp_timeout 180000000
--optim adamw_torch
--quantization_bit 4
--quantization_method bitsandbytes
--lora_rank 8
--lora_alpha 16
--lora_dropout 0
--lora_target all

[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,980 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,980 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,981 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,981 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,981 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2475] 2024-12-11 09:36:15,535 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2024-12-11 09:36:15] llamafactory.data.template:157 >> Replace eos token: <|eot_id|>
[INFO|2024-12-11 09:36:15] llamafactory.data.template:157 >> Add pad token: <|eot_id|>
[INFO|2024-12-11 09:36:15] llamafactory.data.loader:157 >> Loading dataset train.json...
[rank1]:[W1211 09:36:15.439571225 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]:[W1211 09:36:16.042658959 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
W1211 09:36:16.766000 4052 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 4072 closing signal SIGTERM
E1211 09:36:16.798000 4052 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 0 (pid: 4071) of binary: /home/localadmin/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/localadmin/anaconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/localadmin/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-11_09:36:16
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 4071)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 4071
============================================================

Expected behavior

请问这段报错的原因是什么

Others

No response

The text was updated successfully, but these errors were encountered:

laujianmin · 2024-12-15T14:10:09Z

+1 same issue.

Mrxyy · 2024-12-15T16:16:55Z

+1

laujianmin · 2024-12-16T03:22:41Z

I found the solution in the Issue, which is commented to Deepspeed :

use deepspeed==0.15.4 solve the problem.

I think this is a issue with deepspeed withoud LLaMa-Factory

github-actions bot added the pending This problem is yet to be addressed label Dec 11, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 17, 2024

hiyouga closed this as completed Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

单机多卡报错 #6311

单机多卡报错 #6311

122550888 commented Dec 11, 2024 •

edited by hiyouga

Loading

laujianmin commented Dec 15, 2024

Mrxyy commented Dec 15, 2024

laujianmin commented Dec 16, 2024

单机多卡报错 #6311

单机多卡报错 #6311

Comments

122550888 commented Dec 11, 2024 • edited by hiyouga Loading

Reminder

System Info

Reproduction

Expected behavior

Others

laujianmin commented Dec 15, 2024

Mrxyy commented Dec 15, 2024

laujianmin commented Dec 16, 2024

122550888 commented Dec 11, 2024 •

edited by hiyouga

Loading