Welcome to the Academic Text Simplification Model project. This repository contains the implementation and analysis of a machine learning model designed to simplify complex academic texts. The goal is to make academic content more accessible by transforming it into a simpler form while retaining the original meaning.
Academic writing can be challenging for many readers due to its complex structure and specialized vocabulary. This project aims to simplify such texts using advanced natural language processing (NLP) techniques, making them more accessible to a broader audience including students, non-native English speakers, and individuals with reading difficulties.
- Data Preprocessing: Efficient methods for cleaning and preparing academic texts.
- Model Training: Utilizes state-of-the-art NLP models for text simplification.
- Evaluation Metrics: Comprehensive evaluation using metrics such as BLEU, ROUGE, and SARI.
- User Interface: A simple interface for users to input text and receive simplified versions.
To get started with this project, follow the steps below:
-
Clone the repository:
git clone https://github.com/yourusername/academic-text-simplification.git cd academic-text-simplification
-
Create a virtual environment:
python3 -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required dependencies:
pip install -r requirements.txt
To use the academic text simplification model, follow these steps:
-
Prepare your dataset: Ensure your data is in the correct format as specified in the
data/
directory. -
Run the Jupyter notebook:
jupyter notebook Academic_Text_Simplification_Model.ipynb
This notebook includes code cells for preprocessing data, training the model, and evaluating its performance.
-
Input text for simplification: Use the provided user interface to input academic text and receive simplified output.
The dataset used in this project includes a collection of academic texts along with their simplified versions. Detailed information about the dataset can be found in the data/
directory.
The model leverages advanced NLP techniques, including transformer-based architectures such as BERT and GPT, to effectively simplify text. Key components include:
- Tokenization: Breaking down text into manageable pieces.
- Embedding: Representing text in numerical format.
- Attention Mechanisms: Capturing contextual relationships within the text.
- Decoder: Generating simplified text.
The model's performance is evaluated using several metrics:
- BLEU: Measures the overlap between the generated text and the reference text.
- ROUGE: Evaluates the quality of the summary by comparing it to reference summaries.
- SARI: Specifically designed for text simplification tasks, measuring the simplicity, adequacy, and fluency of the generated text.
The results demonstrate the model's ability to produce simplified text that retains the original meaning while being easier to understand.
For any inquiries or further information, please contact:
- Name: Daniyal Ahmed
- Email: daniyalahmedmj@gmail.com
- GitHub: yourusername