Image Feature Extraction Solution (PLEASE NOTE THAT .env AND .idea folders have been removed to make size <500mb>)
- Initial Approach
- Exploratory Data Analysis (EDA)
- OCR Experiments
- Machine Learning Approach
- Final Solution: Qwen2-VL-for-OCR-VQA
- Conclusion
Our team tackled the problem of feature extraction from images using a multi-step process. We began with exploratory data analysis to understand the dataset, followed by experiments with various OCR technologies and machine learning models. Ultimately, we found success using a pre-trained vision-language model.
Important Note: Throughout our entire process, we did not use any external APIs or Gateways. All processing was done locally using publicly available tools and models.
We started by analyzing the distribution of entity types in our dataset. Our findings showed that the number of unique classes was 8, with the following distribution:
entity_name
depth 45127
height 43597
item_volume 7682
item_weight 102786
maximum_weight_recommendation 3263
voltage 9466
wattage 7755
width 44183
To create a balanced dataset for our initial experiments, we selected 2,000 samples from each entity type.
We conducted experiments with various OCR (Optical Character Recognition) technologies:
- Tesseract
- EasyOCR
- KerasOCR
After testing on a small sample of 100 files, we found that EasyOCR performed the best. We then proceeded to run OCR on all files in the test set.
Simultaneously with our OCR efforts, we explored a machine learning approach:
- We used VGG16 for feature extraction from images.
- We split the entity values into two parts: units and numeric values.
- We attempted to build a regression and classification ensemble using PyCaret.
However, this approach took longer than expected to build and train.
As our machine learning approach was time-consuming, we experimented with pre-trained models available in public libraries. We tried several models, including:
- LLaVA
- Eagle-V5-13B
- Qwen2-VL-for-OCR-VQA
The Qwen2-VL-for-OCR-VQA model proved to be the fastest and most effective for our task. This model combines OCR capabilities with visual question answering, allowing us to extract entity values from images efficiently.
Key Point: Qwen2-VL-for-OCR-VQA is a publicly available model in the TensorFlow library. We did not use any proprietary or closed-source solutions.
- We downloaded all test images overnight to ensure we had local access to the data.
- We used the Qwen2-VL-for-OCR-VQA model from the TensorFlow library to process each image and extract the required entity values.
- We applied regex post-processing to ensure our output matched the required format and passed the sanity check.
Our journey to solve this problem involved multiple approaches and technologies, all of which were publicly available and processed locally. While our initial machine learning approach showed promise, the pre-trained Qwen2-VL-for-OCR-VQA model from the TensorFlow library provided a more efficient and accurate solution.
This experience highlights the importance of exploring various methods and being open to leveraging existing open-source technologies in solving complex problems. It also demonstrates that powerful solutions can be implemented without relying on external APIs or gateways, using only publicly available tools and models.