You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I got the error RuntimeError: Expected all tensors to be on the same device when running this command: python trainer_rm.py --configs defaults_rm oasst-rm-1-pythia-1.4b. Could you help me to fix it? Thanks
I'm using
4x RTX 3090
AMD Ryzen Threadripper PRO 3955WX 16-Cores
pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel
root@C.6468167:/workspace/OA/model/model_training$ python trainer_rm.py --configs defaults_rm oasst-rm-1-pythia-1.4b
[2023-06-28 01:43:44,710] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//127.0.0.1'), PosixPath('8080/jm/4/31574'), PosixPath('http')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/opt/conda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
trainig_conf = Namespace(rng_seed=2703368087, is_reward_model=True, pooling='last', learning_rate='8e-6', gradient_checkpointing=False, gradient_accumulation_steps=4, per_device_train_batch_size=1, per_device_eval_batch_size=5, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon='1e-12', weight_decay=0.0, warmup_steps=50, eval_steps=500, save_steps=1000, save_strategy='steps', max_length=2048, num_train_epochs=2, logging_steps=10, max_grad_norm=2.0, save_total_limit=4, dtype='float32', eval_accumulation_steps=None, freeze_layer=None, cache_dir='.cache', loss_fn='RMLoss', score_l2_reg=0.001, eval_size=None, log_dir='base', quantization=False, seq2seqmodel=False, fuse_gelu=True, log_wandb=True, verbose=False, output_dir='.saved_models_rm', use_custom_sampler=True, residual_dropout=0.01, use_flash_attention=True, sort_by_length=False, per_digit_tokens=False, datasets_extra=[], metrics=['accuracy', 'kendalltau'], deepspeed_config='configs/zero_config.json', max_replies=5, residual_dropout_lima=False, datasets=[{'webgpt': {'val_split': 0.05, 'max_val_set': 1000}}], model_name='andreaskoepf/pythia-1.4b-gpt4all-pretrain', use_system_tag=False, system_property_dropout=0.5, system_add_length=False, wandb_entity='open-assistant', local_rank=-1, deepspeed=False, resume_from_checkpoint=False, show_dataset_stats=False, world_size=1)
RNG seed: 2703368087
You are using a model of type gpt_neox to instantiate a model of type gpt_neox_reward_model. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at andreaskoepf/pythia-1.4b-gpt4all-pretrain were not used when initializing GPTNeoXRewardModel: ['embed_out.weight']
- This IS expected if you are initializing GPTNeoXRewardModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPTNeoXRewardModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPTNeoXRewardModel were not initialized from the model checkpoint at andreaskoepf/pythia-1.4b-gpt4all-pretrain and are newly initialized: ['out_proj.bias', 'out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Number of trainable parameters: 1311M
Found cached dataset webgpt_comparisons (/root/.cache/huggingface/datasets/openai___webgpt_comparisons/default/0.0.0/8b5d5879cdc98c4c0099af6053dffe8d504588d43d3b11f1b1ec223ab1e8db0a)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 309.25it/s]
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.15.4
wandb: W&B syncing is set to `offline` in this directory.
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
WARNING:root:Custom sampler found!
/opt/conda/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
0%| | 0/2114 [00:00<?, ?it/s]You're using a GPTNeoXTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2371: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
warnings.warn(
Traceback (most recent call last):
File "/workspace/OA/model/model_training/trainer_rm.py", line 334, in <module>
main()
File "/workspace/OA/model/model_training/trainer_rm.py", line 328, in main
trainer.train(resume_from_checkpoint=training_conf.resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1639, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1906, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2652, in training_step
loss = self.compute_loss(model, inputs)
File "/workspace/OA/model/model_training/trainer_rm.py", line 50, in compute_loss
logits = model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/opt/conda/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/OA/model/model_training/models/reward_model.py", line 63, in forward
outputs = self.gpt_neox(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 553, in forward
outputs = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 320, in forward
attention_layer_outputs = self.attention(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/OA/model/model_training/models/patching.py", line 36, in _patched_attn_forward
out = module.old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 116, in forward
qkv = self.query_key_value(hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /workspace/OA/model/model_training/trainer_rm.py:334 in <module> │
│ │
│ 331 │
│ 332 │
│ 333 if __name__ == "__main__": │
│ ❱ 334 │ main() │
│ 335 │
│ │
│ /workspace/OA/model/model_training/trainer_rm.py:328 in main │
│ │
│ 325 │ │ tokenizer=tokenizer, │
│ 326 │ │ compute_metrics=compute_metrics, │
│ 327 │ ) │
│ ❱ 328 │ trainer.train(resume_from_checkpoint=training_conf.resume_from_checkpoint) │
│ 329 │ trainer.save_model() │
│ 330 │ tokenizer.save_pretrained(output_dir) │
│ 331 │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1639 in train │
│ │
│ 1636 │ │ inner_training_loop = find_executable_batch_size( │
│ 1637 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1638 │ │ ) │
│ ❱ 1639 │ │ return inner_training_loop( │
│ 1640 │ │ │ args=args, │
│ 1641 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1642 │ │ │ trial=trial, │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1906 in _inner_training_loop │
│ │
│ 1903 │ │ │ │ │ with model.no_sync(): │
│ 1904 │ │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1905 │ │ │ │ else: │
│ ❱ 1906 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1907 │ │ │ │ │
│ 1908 │ │ │ │ if ( │
│ 1909 │ │ │ │ │ args.logging_nan_inf_filter │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2652 in training_step │
│ │
│ 2649 │ │ │ return loss_mb.reduce_mean().detach().to(self.args.device) │
│ 2650 │ │ │
│ 2651 │ │ with self.compute_loss_context_manager(): │
│ ❱ 2652 │ │ │ loss = self.compute_loss(model, inputs) │
│ 2653 │ │ │
│ 2654 │ │ if self.args.n_gpu > 1: │
│ 2655 │ │ │ loss = loss.mean() # mean() to average on multi-gpu parallel training │
│ │
│ /workspace/OA/model/model_training/trainer_rm.py:50 in compute_loss │
│ │
│ 47 │ def compute_loss(self, model, inputs, return_logits=False): │
│ 48 │ │ batch, cu_lens = inputs │
│ 49 │ │ │
│ ❱ 50 │ │ logits = model( │
│ 51 │ │ │ input_ids=batch["input_ids"], │
│ 52 │ │ │ attention_mask=batch["attention_mask"], │
│ 53 │ │ ).logits │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py:171 in forward │
│ │
│ 168 │ │ │ if len(self.device_ids) == 1: │
│ 169 │ │ │ │ return self.module(*inputs[0], **kwargs[0]) │
│ 170 │ │ │ replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) │
│ ❱ 171 │ │ │ outputs = self.parallel_apply(replicas, inputs, kwargs) │
│ 172 │ │ │ return self.gather(outputs, self.output_device) │
│ 173 │ │
│ 174 │ def replicate(self, module, device_ids): │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py:181 in parallel_apply │
│ │
│ 178 │ │ return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim) │
│ 179 │ │
│ 180 │ def parallel_apply(self, replicas, inputs, kwargs): │
│ ❱ 181 │ │ return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) │
│ 182 │ │
│ 183 │ def gather(self, outputs, output_device): │
│ 184 │ │ return gather(outputs, output_device, dim=self.dim) │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py:89 in parallel_apply │
│ │
│ 86 │ for i in range(len(inputs)): │
│ 87 │ │ output = results[i] │
│ 88 │ │ if isinstance(output, ExceptionWrapper): │
│ ❱ 89 │ │ │ output.reraise() │
│ 90 │ │ outputs.append(output) │
│ 91 │ return outputs │
│ 92 │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/_utils.py:644 in reraise │
│ │
│ 641 │ │ │ # If the exception takes multiple arguments, don't try to │
│ 642 │ │ │ # instantiate since we don't know how to │
│ 643 │ │ │ raise RuntimeError(msg) from None │
│ ❱ 644 │ │ raise exception │
│ 645 │
│ 646 │
│ 647 def _get_available_device_type(): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/OA/model/model_training/models/reward_model.py", line 63, in forward
outputs = self.gpt_neox(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 553, in forward
outputs = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 320, in forward
attention_layer_outputs = self.attention(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/OA/model/model_training/models/patching.py", line 36, in _patched_attn_forward
out = module.old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/models/gpt_neox/modeling_gpt_neox.py", line 116, in forward
qkv = self.query_key_value(hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /workspace/OA/model/model_training/wandb/offline-run-20230628_014420-b8s8ohdb
wandb: Find logs at: ./wandb/offline-run-20230628_014420-b8s8ohdb/logs
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi, I got the error
RuntimeError: Expected all tensors to be on the same device
when running this command:python trainer_rm.py --configs defaults_rm oasst-rm-1-pythia-1.4b
. Could you help me to fix it? ThanksI'm using
Beta Was this translation helpful? Give feedback.
All reactions