Building a Document AI Pipeline with Google Cloud




Create service account and key

Create a storage bucket

Run the prediction pipeline

BigQuery Final Predictions Section

Building a Document AI Pipeline with Google Cloud

1시간 30분 크레딧 7개


Google Cloud Self-Paced Labs

Document AI is the practice of using AI and machine learning to extract data and insights from text and paper sources such as emails, PDFs, scanned documents, and more. In the past, capturing this unstructured or "dark data" has been an expensive, time-consuming, and error-prone process requiring manual data entry. Today, AI and machine learning have made great advances towards automating this process, enabling businesses to derive insights from, and take advantage of, this data that had been previously untapped.

In a nutshell, document AI allows you to:

  • Organize documents
  • Extract knowledge from documents
  • Increase processing speed

With Google Cloud, you can use individual Document AI products to build the pipeline of your dreams.


This lab is based on this blog:


In order to automate an entire document AI process, multiple machine learning models need to be trained and then deployed together alongside processing steps into an end-to-end pipeline. This can be a daunting process, so in this lab you have been provided sample code for a complete document AI system similar to a data entry workflow capturing structured data from documents.

This example end-to-end document AI pipeline consists of two components:

  1. A training pipeline which formats the training data and uses AutoML to build Image Classification, Entity Extraction, Text Classification, and Object Detection models.
  2. A prediction pipeline which takes PDF documents from a specified Cloud Storage bucket, uses the AutoML models to extract the relevant data from the documents, and stores the extracted data in BigQuery for further analysis.

Training Data

The training data for this example pipeline is from a public dataset containing PDFs of U.S. and European patent title pages with a corresponding BigQuery table of manually entered data from the title pages. The dataset is hosted by the Google Public Datasets Project.

Part 1: The Training Pipeline

The training pipeline consists of the following steps:

  • Training data is pulled from the BigQuery public dataset. The training BigQuery table includes links to PDF files in Cloud Storage of patents from the United States and European Union.

  • The PDF files are converted to PNG files and uploaded to a new Cloud Storage bucket in your own project. The PNG files will be used to train the AutoML Vision models.

  • The PNG files are run through the Cloud Vision API to create TXT files containing the raw text from the converted PDFs. These TXT files are used to train the AutoML Natural Language models.

  • The links to the PNG or TXT files are combined with the labels and features from the BigQuery table into a CSV file in the training data format required by AutoML. This CSV is then uploaded to a Cloud Storage bucket. Note: This format is different for each type of AutoML model.

  • This CSV is used to create an AutoML dataset and model. Both are named in the format patent_demo_data_%m%d%Y_%H%M%S. Some AutoML models can sometimes take hours to train.

이 실습의 나머지 부분과 기타 사항에 대해 알아보려면 Qwiklabs에 가입하세요.

  • Google Cloud Console에 대한 임시 액세스 권한을 얻습니다.
  • 초급부터 고급 수준까지 200여 개의 실습이 준비되어 있습니다.
  • 자신의 학습 속도에 맞춰 학습할 수 있도록 적은 분량으로 나누어져 있습니다.
이 실습을 시작하려면 가입하세요