license | title | sdk | App link | Fine tuned Model |
---|---|---|---|---|
mit |
Plagiarism-detector-using-Fine-tuned-smolLM |
streamlit |
This repository contains a Streamlit-based web application that uses a fine-tuned LLM model for detecting plagiarism between two documents. The application processes two uploaded PDF files, extracts their content, and classifies them as either plagiarized or non-plagiarized based on a fine-tuned language model.
The app leverages a custom fine-tuned version of the SmolLM (135M parameters) that has been trained on the MIT Plagiarism Detection Dataset for improved performance in identifying textual similarities. This model provides binary classification outputs, indicating if two given documents are plagiarized or original.
- Upload PDF Files: Upload two PDF files that the app will analyze for similarity.
- Text Extraction: Extracts raw text from the uploaded PDFs using PyMuPDF.
- Model-Based Detection: Compares the content of the PDFs and classifies them as plagiarized or non-plagiarized using the fine-tuned language model.
- User-Friendly Interface: Built with Streamlit for an intuitive and interactive experience.
- Base Model:
HuggingFaceTB/SmolLM2-135M-Instruct
- Fine-tuned Model Name:
jatinmehra/smolLM-fine-tuned-for-plagiarism-detection
- Language: English
- Task: Text Classification (Binary)
- Performance Metrics: Accuracy, F1 Score, Recall
- License: MIT
The fine-tuning dataset, the MIT Plagiarism Detection Dataset, provides labeled sentence pairs where each pair is marked as plagiarized or non-plagiarized. This label is used for binary classification, making it well-suited for detecting sentence-level similarity.
- Architecture: The model was modified for sequence classification with two labels.
- Optimizer: AdamW with a learning rate of 2e-5.
- Loss Function: Cross-Entropy Loss.
- Batch Size: 16
- Epochs: 3
- Padding: Custom padding token to align with SmolLM requirements.
The model achieved 99.66% accuracy on the training dataset, highlighting its effectiveness in identifying plagiarized content.
- Load and Initialize: The application loads the fine-tuned model and tokenizer locally.
- PDF Upload: Users upload two PDF documents they want to compare.
- Text Extraction: Text is extracted from each PDF using the PyMuPDF library.
- Preprocessing: The extracted text is tokenized and preprocessed for model compatibility.
- Classification: The model processes the inputs and returns a prediction of
1
(plagiarized) or0
(non-plagiarized). - Output: The result is displayed on the Streamlit interface.
- Streamlit for running the web application interface.
- Transformers from Hugging Face for handling model and tokenizer.
- PyMuPDF (
fitz
) for PDF text extraction. - Torch for model inference on CPU or GPU.
-
Clone the repository:
bash
Copy code
git clone https://github.com/jatinmehra119/Plagiarism-detector-using-smolLM-.git cd Plagiarism-detector-using-smolLM-
-
Install the required dependencies:
bash
Copy code
pip install -r requirements.txt
-
Download the fine-tuned model files and place them in the
model/
directory.
Run the Streamlit app from the terminal:
bash
Copy code
streamlit run app.py
- Open the application in your browser (default at
http://localhost:8501
). - Upload two PDF files you wish to compare for plagiarism.
- View the text from each document and the resulting plagiarism detection output.
The model was evaluated on both training and test data, showing robust results:
- Training Set Accuracy: 99.66%
- Test Set Accuracy: 100%
- F1 Score: 1.0
- Recall: 1.0
These metrics indicate the model's high effectiveness in detecting plagiarism.
The model and tokenizer are saved locally, but they can also be loaded directly from Hugging Face. This setup allows easy loading for custom applications or further fine-tuning.
This project is licensed under the MIT License, making it free for both personal and commercial use.
I appreciate your interest!
GitHub | Email-jatinmehra@outlook.in | LinkedIn | Portfolio