Skip to content

This repository contains the implementation of a Transformer-based model for abstractive text summarization and a rule-based approach for extractive text summarization.

License

Notifications You must be signed in to change notification settings

MohanKrishnaGR/Infosys_Text-Summarization

Repository files navigation

springboard-logo-removebg-preview

CI/CD Pipeline Deploy to Azure Container Instance

Docker Image Version Docker Pulls Docker Image Size

Infosy_Text-Summarization

A project by Mohan Krishna G R, AI/ML Intern @ Infosys Springboard, Summer 2024.

Contents

Problem Statement

  • Developing an automated text summarization system that can accurately and efficiently condense large bodies of text into concise summaries is essential for enhancing business operations.
  • This project aims to deploy NLP techniques to create a robust text summarization tool capable of handling various types of documents across different domains.
  • The system should deliver high-quality summaries that retain the core information and contextual meaning of the original text.

Project Statement

  • Text Summarization focuses on converting large bodies of text into a few sentences summing up the gist of the larger text.
  • There is a wide variety of applications for text summarization including News Summary, Customer Reviews, Research Papers, etc.
  • This project aims to understand the importance of text summarization and apply different techniques to fulfill the purpose.

Approach to Solution

  • Figure: Intended Plan

Background Research

  • Literature Review

Solution

  • Selected Deep Learning Architecture

Workflow

  • Workflow for Abstractive Text Summarizer:

  • Workflow for Extractive Text Summarizer:

Data Collection

  • Data Preprocessing & Pre-processing Implemented in src/data_preprocessing.
  • Data collection from different sources:
    • CNN, Daily Mail: News
    • BillSum: Legal
    • ArXiv: Scientific
    • Dialoguesum: Conversations
  • Data integration ensures robust and multi-objective data, including News articles, Legal Documents – Acts, Judgements, Scientific papers, and Conversations.
  • Validated the data through Data Statistics and Exploratory Data Analysis (EDA) using Frequency Plotting for every data source.
  • Data cleansing optimized for NLP tasks: removed null records, lowercasing, punctuation removal, stop words removal, and lemmatization.
  • Data splitting using sci-kit learn for training, testing, and validating the model, saved in CSV format.

Abstractive Text Summarization

Model Training & Evaluation

  • Training:
    • Selected transformer architecture for ABSTRACTIVE SUMMARIZATION: fine-tuning a pre-trained model.
    • Chosen Facebook’s Bart Large model for its performance metrics and efficient trainable parameters.
      • 406,291,456 training parameters.

  • Methods:
    • Native PyTorch Implementation
    • Trainer API Implementation

Method 1 - Native PyTorch

  • Trained the model using manual training loop and evaluation loop in PyTorch. Implemented in: src/model.ipynb
  • Model Evaluation: Source code:src/evaluation.ipynb
    • Obtained inconsistent results in inferencing.
    • ROUGE1 (F-Measure) = 00.018
    • There's a suspected tensor error while training using method 1, which could be attributed to the inconsistency of the model's output.
    • Rejected for the further deployment.
    • Dire need to implement alternative approach.

Method 2 – Trainer Class Implementation

  • Utilized Trainer API from Hugging Face for optimized transformer model training. Implemented in: src/bart.ipynb

    • The model was trained with whole dataset for 10 epochs for 26:24:22 (HH:MM:SS) in 125420 steps.
  • Evaluation: Performance metrics using ROUGE scores. Source code: src/rouge.ipynb

    • Model 2 - results outperformed that of method 1.
    • ROUGE1 (F-Measure) = 61.32 -> Benchmark grade
      • Significantly higher than typical scores reported for state-of-the-art models on common datasets.
    • GPT4 performance for text summarization - ROUGE1 (F-Measure) is 63.22
    • Selected for further deployment.
  • Comparative analysis showed significant improvement in performance after fine-tuning. Source code: src/compare.ipynb


Extractive Text Summarization

  • Rather than choosing computationally intensive deep-learning models, utilizing a rule based approach will result in optimal solution. Utilized a new-and-novel approach of combining the matrix obtained from TF-IDF and KMeans Clustering methodology.
  • It is the expanded topic modeling specifically to be applied to multiple lower-level specialized entities (i.e., groups) embedded in a single document. It operates at the individual document and cluster level.
  • The sentence closest to the centroid (based on Euclidean distance) is selected as the representative sentence for that cluster.
  • Implementation: Preprocess text, extract features using TF-IDF, and summarize by selecting representative sentences.
    • Source code for implentation & evaluation: src/Extractive_Summarization.ipynb
    • ROUGE1 (F-Measure) = 24.71

Testing

  • Implemented text summarization application using Gradio library for a web-based interface, for testing the model's inference.
  • Source Code: src/interface.ipynb

Deployment


Application

  • File Structure: summarize/

API Endpoints

  • Developed using FastAPI framework for handling URLs, files, and direct text input.
    • Source Code: summarizer/app.py
  • Endpoints:
    • Root Endpoint
    • Summarize URL
    • Summarize File
    • Summarize Text

Extractor Modules

  • Extract text from various sources (URLs, PDF, DOCX) using BeautifulSoup and fitz.
  • Source Code: summarizer/extractors.py

Extractive Summary Script

  • Implemented extractive summarizer module. Same as implemented in: src/bart.ipynb
  • Source Code: summarizer/extractive_summary.py

User Interface

  • Developed a user-friendly interface using HTML, CSS, and JavaScript.
  • Source Code: summarizer/templates/index.html

Containerization

  • Developed a Dockerfile to build a Docker image for the FastAPI application.
  • Source Code: summarizer/Dockerfile
  • Image: Docker Image

CI/CD Pipeline

  • Developed a CI/CD pipeline using Docker, Azure and GitHub Actions.
  • Utilized Azure Container Instance (ACI) for deploying the image, triggers for every push to the main branch.
  • Source Code: .github/workflows/azure.yml
    • .github/workflows/main.yml (AWS)
    • .github/workflows/azure.yml (Azure)
  • To use the docker image run:
docker pull mohankrishnagr/infosys_text-summarization:final
docker run -p 8000:8000 mohankrishnagr/infosys_text-summarization:final

Then checkout at,

http://localhost:8000/

Deployed in AWS EC2 (Not Recommended under free trail)

Public IPv4:

http://54.168.82.95/

Deployed in Azure Container Instance (Recommended)

Public IPv4:

http://20.219.203.134:8000/

FQDN

http://mohankrishnagr.centralindia.azurecontainer.io:8000/
  • Screenshots:





End Note

Thank you for your interest in our project! We welcome any feedback. Feel free to reach out to us.

About

This repository contains the implementation of a Transformer-based model for abstractive text summarization and a rule-based approach for extractive text summarization.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages