This architecture uses click-to-deploy to create a pipeline for extracting data from documents with Document AI and storing valuable data on Big Query. It leverages AI summarization to provide concise overviews of the extracted insights, improving accessibility and efficiency.
This architecture is designed to extract data from documents using Google Document AI form processor and combine the power of a scalable data warehouse like Big Query enabling organizations to automate the extraction of structured data from various types of documents, such as forms, invoices, receipts, and more.
In this architecture, documents are uploaded to Google Cloud Storage. An event trigger is set up to detect new document uploads, which then triggers a primary Cloud Function which utilizes the Google Document AI form processor, a powerful machine learning-based service, to analyze the documents and extract structured data from them.
The Document AI form processor applies machine learning models to automatically identify form fields, extract their values, and map them to appropriate data types. It leverages advanced techniques such as optical character recognition (OCR), natural language processing, and entity extraction. Along with structured data extraction, the processor can generate concise AI-powered summaries of the document's key points. The extracted form data is then saved to BigQuery so organizations can leverage BigQuery's powerful querying capabilities, data visualization tools, and machine learning capabilities to gain insights from the extracted form data.
In summary, this architecture allows for seamless integration between the Document AI form processor, AI summarization capabilities, and BigQuery, enabling organizations to automate the extraction of structured data from documents, obtain quick summaries of essential content, and leverage the combined information for various analytics and decision-making purposes.
These are some examples of the use cases you can build on top of this architecture:
-
Invoice Processing Automation : This pipeline can automate the extraction of key data from invoices, such as vendor details, invoice numbers, line item information, and total amounts. Organizations can streamline their accounts payable processes, reduce manual data entry, and improve accuracy in invoice processing.
-
Contract Management and Analysis : Organizations can utilize the pipeline to efficiently process and analyze contracts. Document AI can extract critical information from contracts, such as parties involved, key terms and conditions, effective dates, and obligations.
-
Document Classification and Sorting : The document processing pipeline can be utilized to classify and sort large volumes of documents automatically. By leveraging Document AI's capabilities, the pipeline can analyze the content of documents, identify patterns, and classify them into specific categories.
The main components that we would be setting up are (to learn more about these products, click on the hyperlinks)
- Cloud Storage (GCS) bucket : for storing extracted data that must undergo some kind of transformation.
- BigQuery : Serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data.
- Document AI : Extract structured data from documents and analyze, search and store this data.
- Cloud Function : Run your code in the cloud with no servers or containers to manage with our scalable, pay-as-you-go functions as a service (FaaS) product.
Pricing Estimates - We have created a sample estimate based on some usage we see from new startups looking to scale. This estimate would give you an idea of how much this deployment would essentially cost per month at this scale and you extend it to the scale you further prefer. Here's the link.
🕐 Estimated deployment time: 10 min
- Click on Open in Google Cloud Shell button below.
- Run the prerequisites script to enable APIs and set Cloud Build permissions.
sh prereq.sh
Please note - New organizations have the 'Enforce Domain Restricted Sharing' policy enforced by default. You may have to edit the policy to allow public access to your Cloud Run instance. Please refer to this page for more information.
- Run the Cloud Build Job
gcloud builds submit . --config build/cloudbuild.yaml
If you face a problem with the EventArc API during the deployment, please check out the known issues section.
Once you deployed the solution successfully, upload the form.pdf
to the input bucket using either Cloud Console or gsutil
.
gsutil cp assets/form.pdf gs://<YOUR PROJECT NAME>-doc-ai-form-input
Then, check the parsed results in the output bucket in text (OCR) and json (Key=value) formats
Finally, check the json results on BigQuery
Execute the command below on Cloud Shell to delete the resources.
gcloud builds submit . --config build/cloudbuild_destroy.yaml
You might face the error below while running it for the first time.
Step #2 - "tf apply": │ Error: Error creating function: googleapi: Error 400: Cannot create trigger projects/doc-ai-test4/locations/us-central1/triggers/form-parser-868560: Invalid resource state for "": Permission denied while using the Eventarc Service Agent.
If you recently started to use Eventarc, it may take a few minutes before all necessary permissions are propagated to the Service Agent. Otherwise, verify that it has Eventarc Service Agent role.
It happens because the Eventarc permissions take some time to propagate. First, make sure you ran the pre-req.sh
script. Then, wait some minutes and trigger the deploy job again. Please see the Known issues for Eventarc.