Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单机多卡报错 #6311

Closed
1 task done
122550888 opened this issue Dec 11, 2024 · 3 comments
Closed
1 task done

单机多卡报错 #6311

122550888 opened this issue Dec 11, 2024 · 3 comments
Labels
solved This problem has been already solved

Comments

@122550888
Copy link

122550888 commented Dec 11, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.1.dev0
  • Platform: Linux-5.15.0-125-generic-x86_64-with-glibc2.35
  • Python version: 3.11.7
  • PyTorch version: 2.5.1+cu124 (GPU)
  • Transformers version: 4.46.1
  • Datasets version: 3.1.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: Tesla M10
  • Bitsandbytes version: 0.45.0

Reproduction

FORCE_TORCHRUN=1 CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train
--stage sft
--do_train True
--model_name_or_path /home/localadmin/.cache/modelscope/hub/LLM-Research/Llama-3.2-3B
--preprocessing_num_workers 2
--finetuning_type lora
--template llama3
--flash_attn auto
--dataset_dir /home/localadmin/LLaMA-Factory/data
--dataset train
--cutoff_len 2048
--learning_rate 0.0001
--num_train_epochs 20.0
--max_samples 100000
--per_device_train_batch_size 2
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 5
--save_steps 1000
--warmup_steps 0
--packing False
--report_to none
--output_dir /home/localadmin/LLaMA-Factory/saves/Llama-3.2-3B/train_2024-12-06-09-40-15
--bf16 True
--plot_loss True
--ddp_timeout 180000000
--optim adamw_torch
--quantization_bit 4
--quantization_method bitsandbytes
--lora_rank 8
--lora_alpha 16
--lora_dropout 0
--lora_target all

[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,980 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,980 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,981 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,981 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,981 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2475] 2024-12-11 09:36:15,535 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2024-12-11 09:36:15] llamafactory.data.template:157 >> Replace eos token: <|eot_id|>
[INFO|2024-12-11 09:36:15] llamafactory.data.template:157 >> Add pad token: <|eot_id|>
[INFO|2024-12-11 09:36:15] llamafactory.data.loader:157 >> Loading dataset train.json...
[rank1]:[W1211 09:36:15.439571225 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]:[W1211 09:36:16.042658959 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
W1211 09:36:16.766000 4052 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 4072 closing signal SIGTERM
E1211 09:36:16.798000 4052 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 0 (pid: 4071) of binary: /home/localadmin/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/localadmin/anaconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/localadmin/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-11_09:36:16
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 4071)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 4071
============================================================

Expected behavior

请问这段报错的原因是什么

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 11, 2024
@laujianmin
Copy link

+1 same issue.

@Mrxyy
Copy link

Mrxyy commented Dec 15, 2024

+1

@laujianmin
Copy link

I found the solution in the Issue, which is commented to Deepspeed :

use deepspeed==0.15.4 solve the problem.

I think this is a issue with deepspeed withoud LLaMa-Factory

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 17, 2024
@hiyouga hiyouga closed this as completed Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

4 participants