Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index out of range when training on custom dataset #483

Open
TayTT opened this issue May 18, 2024 · 1 comment
Open

Index out of range when training on custom dataset #483

TayTT opened this issue May 18, 2024 · 1 comment

Comments

@TayTT
Copy link

TayTT commented May 18, 2024

The dataset consists of numbers and periods, representing tokens. The issue occurs with both the quickstart version and th regular version, after calling
python train.py config/custom_file.py
where custom_file.py is nearly identical totrain_sheakspare_char.py

The only modification to the prepare file was the addition of a line stripping the input batch off of a comma at the beginning and end, since it too has caused issues.

The errortrace:
Traceback (most recent call last):
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\train.py", line 264, in
losses = estimate_loss()
^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\train.py", line 224, in estimate_loss
logits, loss = model(X, Y)
^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\model.py", line 177, in forward
tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\nn\modules\module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\nn\modules\module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\nn\modules\sparse.py", line 163, in forward
return F.embedding(
^^^^^^^^^^^^
File "C:\Users\Tay\PycharmProjects\nanogpt_test\nanoGPT\venv\Lib\site-packages\torch\nn\functional.py", line 2237, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index out of range in self

I'd be grateful for any assistance or tips, can't seem to figure it out on my own.

@AIVERSON33
Copy link

For anyone having this issue you are appending an end of text token that is greater than the size of the vocabulary. For GPT-2 it automatically is 50256, for most tokenizers its vocab_size-1. So you have to alter prepare.py in openwebtext to fit your encodings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants