Index out of range when training on custom dataset #483

TayTT · 2024-05-18T09:54:25Z

The dataset consists of numbers and periods, representing tokens. The issue occurs with both the quickstart version and th regular version, after calling
python train.py config/custom_file.py
where custom_file.py is nearly identical totrain_sheakspare_char.py

The only modification to the prepare file was the addition of a line stripping the input batch off of a comma at the beginning and end, since it too has caused issues.

The errortrace:
Traceback (most recent call last):
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\train.py", line 264, in
losses = estimate_loss()
^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\train.py", line 224, in estimate_loss
logits, loss = model(X, Y)
^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\model.py", line 177, in forward
tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\nn\modules\sparse.py", line 163, in forward
return F.embedding(
^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\nn\functional.py", line 2237, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index out of range in self

I'd be grateful for any assistance or tips, can't seem to figure it out on my own.

AIVERSON33 · 2024-07-28T09:21:19Z

For anyone having this issue you are appending an end of text token that is greater than the size of the vocabulary. For GPT-2 it automatically is 50256, for most tokenizers its vocab_size-1. So you have to alter prepare.py in openwebtext to fit your encodings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index out of range when training on custom dataset #483

Index out of range when training on custom dataset #483

TayTT commented May 18, 2024

AIVERSON33 commented Jul 28, 2024

Index out of range when training on custom dataset #483

Index out of range when training on custom dataset #483

Comments

TayTT commented May 18, 2024

AIVERSON33 commented Jul 28, 2024