Pokemon Images Classification with Vision Transformer (ViT)

Project Overview

As the Transformers architecture scaled well in Natural Language Processing, the same architecture was applied to images by creating small patches of the image and treating them as tokens. The result was a Vision Transformer (ViT).

This project leverage datasets to download and process pokemon image classification dataset, and then use them to fine-tune a pre-trained ViT.


Vision Transformer (ViT) Architecture

Read more on the Vision Transformer (ViT) model from this paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Authors:

Github: Dungfx15018 and EveTLynn
Email: dungtrandinh513@gmail.com and linhtong1201@gmail.com

Advisors:

Github: bangoc123
Email: protonxai@gmail.com

I. Set up environment

Step 1: Create a Python virtual environment

python -m venv {your_venv_name}

Step 2: Activate the virtual environment

.\{your_venv_name}\Scripts\activate

Step 3: Install dependencies

pip install -r requirements.txt

II. Set up your dataset

This project makes use of pokemon-classification dataset from Hugging Face which includes 6991 images. Pokemon are annotated in folder format.
The dataset can be easily loaded using datasets.load_dataset function

III. Training Process

Training script:

python train.py --batch-size ${batch-size} --learning-rate ${learning-rate} --num-train-epochs ${num-train-epochs}

Example:

python train.py --batch-size 16 --learning-rate 2e-5 --num-train-epochs 10

There are some important arguments for the script you should consider when running it:

batch-size: Specifies the size of each batch of data during training.
learning-rate: Sets the learning rate for the training process, controlling the step size during optimization.
num-train-epochs: Determines the number of complete passes through the entire training dataset during training.

IV. Predict Process

python predict.py --test-data ${link_to_test_data}

V. Result and Comparision

***** train metrics *****
  epoch                    =      22.8571
  total_flos               = 9237299936GF
  train_loss               =       3.4517
  train_runtime            =   1:20:27.19
  train_samples_per_second =       26.516
  train_steps_per_second   =        0.414

***** eval metrics *****
  epoch                   =    22.8571
  eval_accuracy           =     0.8499
  eval_loss               =     2.7948
  eval_runtime            = 0:00:18.49
  eval_samples_per_second =     75.625
  eval_steps_per_second   =      4.757

The evaluation accuracy peaked at step 1750 with a value of 87.78%, then started to decline. The loss for both training and evaluation continued to decrease. It suggests that the model's performance may improve with additional training steps. However, the training loss is higher than the evaluation loss, suggesting that the model might be overfitting to the training data.


Training Loss	Evaluation Accuracy	Evaluation Loss

Train and see the evaluation metrics with Tensorboard on Colab.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
README.MD		README.MD
data.py		data.py
predict.py		predict.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pokemon Images Classification with Vision Transformer (ViT)

Project Overview

Authors:

Advisors:

I. Set up environment

II. Set up your dataset

III. Training Process

IV. Predict Process

V. Result and Comparision

About

Releases

Packages

Languages

protonx-tf-08/Pokemon-Images-Classification-with-Vision-Transformer-ViT

Folders and files

Latest commit

History

Repository files navigation

Pokemon Images Classification with Vision Transformer (ViT)

Project Overview

Authors:

Advisors:

I. Set up environment

II. Set up your dataset

III. Training Process

IV. Predict Process

V. Result and Comparision

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages