GitHub - abir0/Manuscript-Matcher: A multi-label text classification project to classify the category of research papers from its title and abstract.

Manuscript Matcher

A Multi-label Text Classification Project to match manuscripts or research articles with suitable journals based on the title and abstract of the manuscript.

About The Project

Match Manuscript is a Multi-label Text Classification project to classify a manuscript or research article based on its subject category and with that it finds the best matching journals for the manuscript. It uses state-of-the-art Language Model DistilRoBERTa, a transformer model from Hugging Face. A large dataset is scraped, collected, and cleaned before the model was trained. The final model is then deployed as a web app on Render. The project can be divided into 3 main parts: Data Collection, Model Training, and Deployment.

Data Collection

The primary data collection was done by scraping the Directory of Open Access Journals (DOAJ). DOAJ is an index or database of journals and articles' metadata. From this website, the journal name, article's title and abstract fields were extracted. The scraping was done using Selenium framework. A secondary dataset was collected from my previous project on Journal Ranking. This dataset had the journal's subject category information. The final data for training was obtained after cleaning and processing and merging the datasets. This data pre-processing is performed in the data processing notebook on Google Colab.

Dataset Description

The final dataset had the following fields:

Journal Name
Title
Abstract
Subject Category

The dataset had a total of 35k+ rows. And there were 61 labels in the subject category field, which is the multi-labelled target.

DOAJ Website

Model Training

The model training is done using the DistilRoBERTa model from Hugging Face Transformers library. This is pre-trained model which is follows the BERT model architecture but it is a distilled version. This model is integrated into the project by transfer learning. After processing and tokenization of the input text data, it is fed into the model. The model training and fine-tuning is performed in the model training notebook. Then the model is quantized and compressed using ONNX Runtime in the model inference notebook and exported as the final model in .onnx file, which can be found in the Hugging Face Spaces repo. Hugging Face Transformers, PyTorch, fastai, and finally ONNX Runtime frameworks have been used for model training to exporting.

Model Description

The final model had the following architecture:

Input Layer (Tokenizer)
DistilRoBERTa Model (Similar to Transformer architecture)
Linear Layer
Sigmoid Activation Layer
Output Layer

After training, the model accuracy was ~97% and the train loss was ~0.03 while the validation loss was ~0.06. The model was trained for a total of 10 epochs. The model was trained on the Google Colab GPU.

After ONNX Runtime quantization and compression, the model size was reduced from ~315 MB to ~79 MB. The final F1 scores were: F1 Macro = 0.68 and F1 Micro = 0.58.

Transformer Model Architecture (arXiv:1706.03762)

App Deployment

The final model is hosted on the Hugging Face Spaces. A web app with minimal UI is built around the hosted model using the Flask. The app is deployed on Render web services.

Hugging Face Spaces App

Deployed Web App

The deployed app can be accessed HERE: https://manuscript-matcher-beta.onrender.com

Note: The app is still in beta version. It does not suggest the best matching journals yet. It only classifies the manuscript into subject categories. The best matching journals feature will be added in the future.

Built With

Steps to Reproduce

To reproduce the results, follow the steps below:

Clone the GitHub project repo.

git clone https://github.com/abir0/Manuscript-Matcher.git

Install the requirements.

pip install -r requirements.txt

Run the scraper to collect the data from DOAJ. The scraped data will be saved in the data folder.

cd src
python scraper.py

Download all the data files from the data folder and place them in the Google Drive folder named Manuscript-Matcher.
Go to the data processing notebook and run all the cells to clean and preprocess the data.
Go to the model training notebook and run all the cells to train the model.
Go to the model inference notebook and run all the cells to evaluate, quantize, and export the final model.
Clone the Hugging Face Spaces repository and download the final .onnx model into the repo and setup a Hugging Face Space to host the model.

git clone https://huggingface.co/spaces/abir0/Manuscript-Matcher

Clone the Flask web app repository and configure the Hugging Face Spaces API.

git clone https://github.com/abir0/Manuscript-Matcher-Beta.git

Setup a Render Web Service and deploy the flask web app on the Render platform.

Links

Manuscript Matcher Project: https://github.com/abir0/Manuscript-Matcher
DOAJ Articles Search Endpoint: https://doaj.org/search/articles
Journal Ranking Project: https://github.com/abir0/SJR-Journal-Ranking
Hugging Face Spaces: https://huggingface.co/spaces/abir0/Manuscript-Matcher
Hugging Face Spaces Direct Link: https://abir0-manuscript-matcher.hf.space
Flask Web App Repository: https://github.com/abir0/Manuscript-Matcher-Beta
Deployed Web App on Render: https://manuscript-matcher-beta.onrender.com

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Abir Hassan

Gmail | GitHub | LinkedIn

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
images		images
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Manuscript Matcher

Table of Contents

About The Project

Data Collection

Dataset Description

Model Training

Model Description

App Deployment

Built With

Steps to Reproduce

Links

License

Contact

Abir Hassan

About

Releases

Packages

Languages

License

abir0/Manuscript-Matcher

Folders and files

Latest commit

History

Repository files navigation

Manuscript Matcher

Table of Contents

About The Project

Data Collection

Dataset Description

Model Training

Model Description

App Deployment

Built With

Steps to Reproduce

Links

License

Contact

Abir Hassan

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages