As the Transformers architecture scaled well in Natural Language Processing, the same architecture was applied to images by creating small patches of the image and treating them as tokens. The result was a Vision Transformer (ViT).
This project leverage datasets
to download and process pokemon image classification dataset, and then use them to fine-tune a pre-trained ViT.
Vision Transformer (ViT) Architecture |
Read more on the Vision Transformer (ViT) model from this paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Github: Dungfx15018 and EveTLynn
- Email: dungtrandinh513@gmail.com and linhtong1201@gmail.com
- Github: bangoc123
- Email: protonxai@gmail.com
- Step 1: Create a Python virtual environment
python -m venv {your_venv_name}
- Step 2: Activate the virtual environment
.\{your_venv_name}\Scripts\activate
- Step 3: Install dependencies
pip install -r requirements.txt
- This project makes use of pokemon-classification dataset from Hugging Face which includes 6991 images. Pokemon are annotated in folder format.
- The dataset can be easily loaded using
datasets.load_dataset
function
Training script:
python train.py --batch-size ${batch-size} --learning-rate ${learning-rate} --num-train-epochs ${num-train-epochs}
Example:
python train.py --batch-size 16 --learning-rate 2e-5 --num-train-epochs 10
There are some important arguments for the script you should consider when running it:
batch-size
: Specifies the size of each batch of data during training.learning-rate
: Sets the learning rate for the training process, controlling the step size during optimization.num-train-epochs
: Determines the number of complete passes through the entire training dataset during training.
python predict.py --test-data ${link_to_test_data}
***** train metrics *****
epoch = 22.8571
total_flos = 9237299936GF
train_loss = 3.4517
train_runtime = 1:20:27.19
train_samples_per_second = 26.516
train_steps_per_second = 0.414
***** eval metrics *****
epoch = 22.8571
eval_accuracy = 0.8499
eval_loss = 2.7948
eval_runtime = 0:00:18.49
eval_samples_per_second = 75.625
eval_steps_per_second = 4.757
The evaluation accuracy peaked at step 1750 with a value of 87.78%, then started to decline. The loss for both training and evaluation continued to decrease. It suggests that the model's performance may improve with additional training steps. However, the training loss is higher than the evaluation loss, suggesting that the model might be overfitting to the training data.
Training Loss | Evaluation Accuracy | Evaluation Loss |
Train and see the evaluation metrics with Tensorboard on Colab.