Skip to content

protonx-tf-08/Pokemon-Images-Classification-with-Vision-Transformer-ViT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pokemon Images Classification with Vision Transformer (ViT)

Project Overview

As the Transformers architecture scaled well in Natural Language Processing, the same architecture was applied to images by creating small patches of the image and treating them as tokens. The result was a Vision Transformer (ViT).

This project leverage datasets to download and process pokemon image classification dataset, and then use them to fine-tune a pre-trained ViT.

image
Vision Transformer (ViT) Architecture

Read more on the Vision Transformer (ViT) model from this paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Authors:

Advisors:

I. Set up environment

  • Step 1: Create a Python virtual environment
python -m venv {your_venv_name}
  • Step 2: Activate the virtual environment
.\{your_venv_name}\Scripts\activate 
  • Step 3: Install dependencies
pip install -r requirements.txt

II. Set up your dataset

  • This project makes use of pokemon-classification dataset from Hugging Face which includes 6991 images. Pokemon are annotated in folder format.
  • The dataset can be easily loaded using datasets.load_dataset function

III. Training Process

Training script:

python train.py --batch-size ${batch-size} --learning-rate ${learning-rate} --num-train-epochs ${num-train-epochs}

Example:

python train.py --batch-size 16 --learning-rate 2e-5 --num-train-epochs 10

There are some important arguments for the script you should consider when running it:

  • batch-size: Specifies the size of each batch of data during training.
  • learning-rate: Sets the learning rate for the training process, controlling the step size during optimization.
  • num-train-epochs: Determines the number of complete passes through the entire training dataset during training.

IV. Predict Process

python predict.py --test-data ${link_to_test_data}

V. Result and Comparision

***** train metrics *****
  epoch                    =      22.8571
  total_flos               = 9237299936GF
  train_loss               =       3.4517
  train_runtime            =   1:20:27.19
  train_samples_per_second =       26.516
  train_steps_per_second   =        0.414

***** eval metrics *****
  epoch                   =    22.8571
  eval_accuracy           =     0.8499
  eval_loss               =     2.7948
  eval_runtime            = 0:00:18.49
  eval_samples_per_second =     75.625
  eval_steps_per_second   =      4.757

The evaluation accuracy peaked at step 1750 with a value of 87.78%, then started to decline. The loss for both training and evaluation continued to decrease. It suggests that the model's performance may improve with additional training steps. However, the training loss is higher than the evaluation loss, suggesting that the model might be overfitting to the training data.

Training Loss Evaluation Accuracy Evaluation Loss

Train and see the evaluation metrics with Tensorboard on Colab.

About

Classify Pokemon Images using Vision Transformer (ViT)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages