A Multi-label Text Classification Project to match manuscripts or research articles with suitable journals based on the title and abstract of the manuscript.
Match Manuscript is a Multi-label Text Classification project to classify a manuscript or research article based on its subject category and with that it finds the best matching journals for the manuscript. It uses state-of-the-art Language Model DistilRoBERTa, a transformer model from Hugging Face. A large dataset is scraped, collected, and cleaned before the model was trained. The final model is then deployed as a web app on Render. The project can be divided into 3 main parts: Data Collection, Model Training, and Deployment.
The primary data collection was done by scraping the Directory of Open Access Journals (DOAJ). DOAJ is an index or database of journals and articles' metadata. From this website, the journal name
, article's title
and abstract
fields were extracted. The scraping was done using Selenium framework. A secondary dataset was collected from my previous project on Journal Ranking. This dataset had the journal's subject category
information. The final data for training was obtained after cleaning and processing and merging the datasets. This data pre-processing is performed in the data processing notebook on Google Colab.
The final dataset had the following fields:
- Journal Name
- Title
- Abstract
- Subject Category
The dataset had a total of 35k+ rows. And there were 61
labels in the subject category
field, which is the multi-labelled target.
The model training is done using the DistilRoBERTa model from Hugging Face Transformers library. This is pre-trained model which is follows the BERT model architecture but it is a distilled version. This model is integrated into the project by transfer learning. After processing and tokenization of the input text data, it is fed into the model. The model training and fine-tuning is performed in the model training notebook. Then the model is quantized and compressed using ONNX Runtime in the model inference notebook and exported as the final model in .onnx
file, which can be found in the Hugging Face Spaces repo. Hugging Face Transformers, PyTorch, fastai, and finally ONNX Runtime frameworks have been used for model training to exporting.
The final model had the following architecture:
- Input Layer (Tokenizer)
- DistilRoBERTa Model (Similar to Transformer architecture)
- Linear Layer
- Sigmoid Activation Layer
- Output Layer
After training, the model accuracy was ~97% and the train loss was ~0.03 while the validation loss was ~0.06. The model was trained for a total of 10 epochs. The model was trained on the Google Colab GPU.
After ONNX Runtime quantization and compression, the model size was reduced from ~315 MB to ~79 MB. The final F1 scores were: F1 Macro = 0.68 and F1 Micro = 0.58.
Transformer Model Architecture (arXiv:1706.03762)
The final model is hosted on the Hugging Face Spaces. A web app with minimal UI is built around the hosted model using the Flask. The app is deployed on Render web services.
The deployed app can be accessed HERE: https://manuscript-matcher-beta.onrender.com
Note: The app is still in beta version. It does not suggest the best matching journals yet. It only classifies the manuscript into
subject categories
. The best matching journals feature will be added in the future.
- Python
- Selenium
- Pandas
- Hugging Face Transformers
- Hugging Face Spaces
- PyTorch
- fastai
- ONNX Runtime
- Flask
- Bootstrap
- Render
To reproduce the results, follow the steps below:
- Clone the GitHub project repo.
git clone https://github.com/abir0/Manuscript-Matcher.git
- Install the requirements.
pip install -r requirements.txt
- Run the scraper to collect the data from DOAJ. The scraped data will be saved in the data folder.
cd src
python scraper.py
-
Download all the data files from the data folder and place them in the Google Drive folder named
Manuscript-Matcher
. -
Go to the data processing notebook and run all the cells to clean and preprocess the data.
-
Go to the model training notebook and run all the cells to train the model.
-
Go to the model inference notebook and run all the cells to evaluate, quantize, and export the final model.
-
Clone the Hugging Face Spaces repository and download the final
.onnx
model into the repo and setup a Hugging Face Space to host the model.
git clone https://huggingface.co/spaces/abir0/Manuscript-Matcher
- Clone the Flask web app repository and configure the Hugging Face Spaces API.
git clone https://github.com/abir0/Manuscript-Matcher-Beta.git
- Setup a Render Web Service and deploy the flask web app on the Render platform.
- Manuscript Matcher Project: https://github.com/abir0/Manuscript-Matcher
- DOAJ Articles Search Endpoint: https://doaj.org/search/articles
- Journal Ranking Project: https://github.com/abir0/SJR-Journal-Ranking
- Hugging Face Spaces: https://huggingface.co/spaces/abir0/Manuscript-Matcher
- Hugging Face Spaces Direct Link: https://abir0-manuscript-matcher.hf.space
- Flask Web App Repository: https://github.com/abir0/Manuscript-Matcher-Beta
- Deployed Web App on Render: https://manuscript-matcher-beta.onrender.com
Distributed under the MIT License. See LICENSE
for more information.