A project by Mohan Krishna G R, AI/ML Intern @ Infosys Springboard, Summer 2024.
- Problem Statement
- Project Statement
- Approach to Solution
- Background Research
- Solution
- Workflow
- Data Collection
- Abstractive Text Summarization
- Extractive Text Summarization
- Testing
- Deployment
- Containerization
- CI/CD Pipeline
- Developing an automated text summarization system that can accurately and efficiently condense large bodies of text into concise summaries is essential for enhancing business operations.
- This project aims to deploy NLP techniques to create a robust text summarization tool capable of handling various types of documents across different domains.
- The system should deliver high-quality summaries that retain the core information and contextual meaning of the original text.
- Text Summarization focuses on converting large bodies of text into a few sentences summing up the gist of the larger text.
- There is a wide variety of applications for text summarization including News Summary, Customer Reviews, Research Papers, etc.
- This project aims to understand the importance of text summarization and apply different techniques to fulfill the purpose.
- Figure: Intended Plan
- Literature Review
- Selected Deep Learning Architecture
- Workflow for Abstractive Text Summarizer:
- Workflow for Extractive Text Summarizer:
- Data Preprocessing & Pre-processing Implemented in
src/data_preprocessing
. - Data collection from different sources:
- CNN, Daily Mail: News
- BillSum: Legal
- ArXiv: Scientific
- Dialoguesum: Conversations
- Data integration ensures robust and multi-objective data, including News articles, Legal Documents – Acts, Judgements, Scientific papers, and Conversations.
- Validated the data through Data Statistics and Exploratory Data Analysis (EDA) using Frequency Plotting for every data source.
- Data cleansing optimized for NLP tasks: removed null records, lowercasing, punctuation removal, stop words removal, and lemmatization.
- Data splitting using sci-kit learn for training, testing, and validating the model, saved in CSV format.
- Training:
- Selected transformer architecture for ABSTRACTIVE SUMMARIZATION: fine-tuning a pre-trained model.
- Chosen Facebook’s Bart Large model for its performance metrics and efficient trainable parameters.
- 406,291,456 training parameters.
- Methods:
- Native PyTorch Implementation
- Trainer API Implementation
- Trained the model using manual training loop and evaluation loop in PyTorch. Implemented in:
src/model.ipynb
- Model Evaluation: Source code:
src/evaluation.ipynb
- Obtained inconsistent results in inferencing.
- ROUGE1 (F-Measure) = 00.018
- There's a suspected tensor error while training using method 1, which could be attributed to the inconsistency of the model's output.
- Rejected for the further deployment.
- Dire need to implement alternative approach.
-
Utilized Trainer API from Hugging Face for optimized transformer model training. Implemented in:
src/bart.ipynb
- The model was trained with whole dataset for 10 epochs for 26:24:22 (HH:MM:SS) in 125420 steps.
-
Evaluation: Performance metrics using ROUGE scores. Source code:
src/rouge.ipynb
- Model 2 - results outperformed that of method 1.
- ROUGE1 (F-Measure) = 61.32 -> Benchmark grade
- Significantly higher than typical scores reported for state-of-the-art models on common datasets.
- GPT4 performance for text summarization - ROUGE1 (F-Measure) is 63.22
- Selected for further deployment.
-
Comparative analysis showed significant improvement in performance after fine-tuning. Source code:
src/compare.ipynb
- Rather than choosing computationally intensive deep-learning models, utilizing a rule based approach will result in optimal solution. Utilized a new-and-novel approach of combining the matrix obtained from TF-IDF and KMeans Clustering methodology.
- It is the expanded topic modeling specifically to be applied to multiple lower-level specialized entities (i.e., groups) embedded in a single document. It operates at the individual document and cluster level.
- The sentence closest to the centroid (based on Euclidean distance) is selected as the representative sentence for that cluster.
- Implementation: Preprocess text, extract features using TF-IDF, and summarize by selecting representative sentences.
- Source code for implentation & evaluation:
src/Extractive_Summarization.ipynb
- ROUGE1 (F-Measure) = 24.71
- Source code for implentation & evaluation:
- Implemented text summarization application using Gradio library for a web-based interface, for testing the model's inference.
- Source Code:
src/interface.ipynb
- File Structure:
summarize/
- Developed using FastAPI framework for handling URLs, files, and direct text input.
- Source Code:
summarizer/app.py
- Source Code:
- Endpoints:
- Root Endpoint
- Summarize URL
- Summarize File
- Summarize Text
- Extract text from various sources (URLs, PDF, DOCX) using BeautifulSoup and fitz.
- Source Code:
summarizer/extractors.py
- Implemented extractive summarizer module. Same as implemented in: src/bart.ipynb
- Source Code:
summarizer/extractive_summary.py
- Developed a user-friendly interface using HTML, CSS, and JavaScript.
- Source Code:
summarizer/templates/index.html
- Developed a Dockerfile to build a Docker image for the FastAPI application.
- Source Code:
summarizer/Dockerfile
- Image: Docker Image
- Developed a CI/CD pipeline using Docker, Azure and GitHub Actions.
- Utilized Azure Container Instance (ACI) for deploying the image, triggers for every push to the main branch.
- Source Code:
.github/workflows/azure.yml
.github/workflows/main.yml
(AWS).github/workflows/azure.yml
(Azure)
- To use the docker image run:
docker pull mohankrishnagr/infosys_text-summarization:final
docker run -p 8000:8000 mohankrishnagr/infosys_text-summarization:final
Then checkout at,
http://localhost:8000/
Public IPv4:
http://54.168.82.95/
Public IPv4:
http://20.219.203.134:8000/
FQDN
http://mohankrishnagr.centralindia.azurecontainer.io:8000/
- Screenshots:
Thank you for your interest in our project! We welcome any feedback. Feel free to reach out to us.