AWS Textract Project

New V2 Demo

textract-demo-v2.mp4

This project aims to create a robust application that utilizes AWS Textract to extract information from structured and semi-structured PDF documents, and OpenAI's ChatGPT to interact with the application and analyze the results. These documents may contain both handwritten and printed text.

About AWS Textract

AWS Textract is a service that automatically extracts text and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.

About ChatGPT

ChatGPT is a language model developed by OpenAI. In this application, we use it to interact with users and analyze the results of AWS Textract.

Getting Started

To use this application, follow the steps below:

Prerequisites

You will need an AWS account, OpenAI account, the AWS CLI (Command Line Interface) installed on your local machine, and setup your OpenAI and AWS credentials to authenticate with their services. Here is a guide to set up AWS. For OpenAI setup, please refer to the OpenAI API documentation.

Installing

Install Anaconda on your machine. Visit the Anaconda website for installation instructions.
Create a new conda environment and install the required dependencies by running the following commands:

$ conda env create -f environment.yml
$ conda activate aws-text

Running the Application

Make sure you have activated the aws-text conda environment.
Run the application.py file to start the application. This file sets up a local server that accepts PDF uploads, runs AWS Textract on the uploaded PDFs, then uses ChatGPT to interact with the application and analyze the results.

$ python application.py

Access the application by opening your web browser and navigating to http://localhost:5000.
Upload your PDF files through the web interface. The results of AWS Textract will be displayed on the page, and you can also interact with ChatGPT to analyze the results. You can download the results in JSON, PDF (with bounding boxes), and CSV formats.

View results

The outputs are located in app/results/textract_results, app/results/bounding_box_results, and app/results/table_results.

Built With

AWS Textract - Text extraction service
ChatGPT - A interactive LLM
Flask - Web framework

Author

Brady Mitchelmore - Initial work - bradymitchelmore

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
app		app
documents		documents
tests		tests
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS Textract Project

New V2 Demo

About AWS Textract

About ChatGPT

Getting Started

Prerequisites

Installing

Running the Application

View results

Built With

Author

About

Releases

Packages

Languages

Bmitch44/textract-demo

Folders and files

Latest commit

History

Repository files navigation

AWS Textract Project

New V2 Demo

About AWS Textract

About ChatGPT

Getting Started

Prerequisites

Installing

Running the Application

View results

Built With

Author

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages