Skip to content

A multi-label text classification project to classify the category of research papers from its title and abstract.

License

Notifications You must be signed in to change notification settings

abir0/Manuscript-Matcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Manuscript Matcher

A Multi-label Text Classification Project to match manuscripts or research articles with suitable journals based on the title and abstract of the manuscript.

Table of Contents

About The Project

Match Manuscript is a Multi-label Text Classification project to classify a manuscript or research article based on its subject category and with that it finds the best matching journals for the manuscript. It uses state-of-the-art Language Model DistilRoBERTa, a transformer model from Hugging Face. A large dataset is scraped, collected, and cleaned before the model was trained. The final model is then deployed as a web app on Render. The project can be divided into 3 main parts: Data Collection, Model Training, and Deployment.

Data Collection

The primary data collection was done by scraping the Directory of Open Access Journals (DOAJ). DOAJ is an index or database of journals and articles' metadata. From this website, the journal name, article's title and abstract fields were extracted. The scraping was done using Selenium framework. A secondary dataset was collected from my previous project on Journal Ranking. This dataset had the journal's subject category information. The final data for training was obtained after cleaning and processing and merging the datasets. This data pre-processing is performed in the data processing notebook on Google Colab.

Dataset Description

The final dataset had the following fields:

  • Journal Name
  • Title
  • Abstract
  • Subject Category

The dataset had a total of 35k+ rows. And there were 61 labels in the subject category field, which is the multi-labelled target.


DOAJ Website

Model Training

The model training is done using the DistilRoBERTa model from Hugging Face Transformers library. This is pre-trained model which is follows the BERT model architecture but it is a distilled version. This model is integrated into the project by transfer learning. After processing and tokenization of the input text data, it is fed into the model. The model training and fine-tuning is performed in the model training notebook. Then the model is quantized and compressed using ONNX Runtime in the model inference notebook and exported as the final model in .onnx file, which can be found in the Hugging Face Spaces repo. Hugging Face Transformers, PyTorch, fastai, and finally ONNX Runtime frameworks have been used for model training to exporting.

Model Description

The final model had the following architecture:

  • Input Layer (Tokenizer)
  • DistilRoBERTa Model (Similar to Transformer architecture)
  • Linear Layer
  • Sigmoid Activation Layer
  • Output Layer

After training, the model accuracy was ~97% and the train loss was ~0.03 while the validation loss was ~0.06. The model was trained for a total of 10 epochs. The model was trained on the Google Colab GPU.

After ONNX Runtime quantization and compression, the model size was reduced from ~315 MB to ~79 MB. The final F1 scores were: F1 Macro = 0.68 and F1 Micro = 0.58.


Transformer Model Architecture (arXiv:1706.03762)

App Deployment

The final model is hosted on the Hugging Face Spaces. A web app with minimal UI is built around the hosted model using the Flask. The app is deployed on Render web services.


Hugging Face Spaces App


Deployed Web App

The deployed app can be accessed HERE: https://manuscript-matcher-beta.onrender.com

Note: The app is still in beta version. It does not suggest the best matching journals yet. It only classifies the manuscript into subject categories. The best matching journals feature will be added in the future.

Built With

Steps to Reproduce

To reproduce the results, follow the steps below:

  1. Clone the GitHub project repo.
git clone https://github.com/abir0/Manuscript-Matcher.git
  1. Install the requirements.
pip install -r requirements.txt
  1. Run the scraper to collect the data from DOAJ. The scraped data will be saved in the data folder.
cd src
python scraper.py
  1. Download all the data files from the data folder and place them in the Google Drive folder named Manuscript-Matcher.

  2. Go to the data processing notebook and run all the cells to clean and preprocess the data.

  3. Go to the model training notebook and run all the cells to train the model.

  4. Go to the model inference notebook and run all the cells to evaluate, quantize, and export the final model.

  5. Clone the Hugging Face Spaces repository and download the final .onnx model into the repo and setup a Hugging Face Space to host the model.

git clone https://huggingface.co/spaces/abir0/Manuscript-Matcher
  1. Clone the Flask web app repository and configure the Hugging Face Spaces API.
git clone https://github.com/abir0/Manuscript-Matcher-Beta.git
  1. Setup a Render Web Service and deploy the flask web app on the Render platform.

Links

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Abir Hassan

Gmail | GitHub | LinkedIn

About

A multi-label text classification project to classify the category of research papers from its title and abstract.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published