You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,980 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,980 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,981 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,981 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2209] 2024-12-11 09:36:14,981 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2475] 2024-12-11 09:36:15,535 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2024-12-11 09:36:15] llamafactory.data.template:157 >> Replace eos token: <|eot_id|>
[INFO|2024-12-11 09:36:15] llamafactory.data.template:157 >> Add pad token: <|eot_id|>
[INFO|2024-12-11 09:36:15] llamafactory.data.loader:157 >> Loading dataset train.json...
[rank1]:[W1211 09:36:15.439571225 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]:[W1211 09:36:16.042658959 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
W1211 09:36:16.766000 4052 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 4072 closing signal SIGTERM
E1211 09:36:16.798000 4052 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 0 (pid: 4071) of binary: /home/localadmin/anaconda3/bin/python
Traceback (most recent call last):
File "/home/localadmin/anaconda3/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/localadmin/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/localadmin/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-12-11_09:36:16
host : localhost
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 4071)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 4071
============================================================
Expected behavior
请问这段报错的原因是什么
Others
No response
The text was updated successfully, but these errors were encountered:
Reminder
System Info
llamafactory
version: 0.9.1.dev0Reproduction
FORCE_TORCHRUN=1 CUDA_VISIBLE_DEVICES=0,1 llamafactory-cli train
--stage sft
--do_train True
--model_name_or_path /home/localadmin/.cache/modelscope/hub/LLM-Research/Llama-3.2-3B
--preprocessing_num_workers 2
--finetuning_type lora
--template llama3
--flash_attn auto
--dataset_dir /home/localadmin/LLaMA-Factory/data
--dataset train
--cutoff_len 2048
--learning_rate 0.0001
--num_train_epochs 20.0
--max_samples 100000
--per_device_train_batch_size 2
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--max_grad_norm 1.0
--logging_steps 5
--save_steps 1000
--warmup_steps 0
--packing False
--report_to none
--output_dir /home/localadmin/LLaMA-Factory/saves/Llama-3.2-3B/train_2024-12-06-09-40-15
--bf16 True
--plot_loss True
--ddp_timeout 180000000
--optim adamw_torch
--quantization_bit 4
--quantization_method bitsandbytes
--lora_rank 8
--lora_alpha 16
--lora_dropout 0
--lora_target all
Expected behavior
请问这段报错的原因是什么
Others
No response
The text was updated successfully, but these errors were encountered: