Building a Document AI Pipeline with Google Cloud




Create service account and key

Create a storage bucket

Run the prediction pipeline

BigQuery Final Predictions Section

Building a Document AI Pipeline with Google Cloud

1 hour 30 minutes 7 Credits


Google Cloud Self-Paced Labs

Document AI is the practice of using AI and machine learning to extract data and insights from text and paper sources such as emails, PDFs, scanned documents, and more. In the past, capturing this unstructured or "dark data" has been an expensive, time-consuming, and error-prone process requiring manual data entry. Today, AI and machine learning have made great advances towards automating this process, enabling businesses to derive insights from, and take advantage of, this data that had been previously untapped.

In a nutshell, document AI allows you to:

  • Organize documents
  • Extract knowledge from documents
  • Increase processing speed

With Google Cloud, you can use individual Document AI products to build the pipeline of your dreams.


This lab is based on this blog:


In order to automate an entire document AI process, multiple machine learning models need to be trained and then deployed together alongside processing steps into an end-to-end pipeline. This can be a daunting process, so in this lab you have been provided sample code for a complete document AI system similar to a data entry workflow capturing structured data from documents.

This example end-to-end document AI pipeline consists of two components:

  1. A training pipeline which formats the training data and uses AutoML to build Image Classification, Entity Extraction, Text Classification, and Object Detection models.
  2. A prediction pipeline which takes PDF documents from a specified Cloud Storage bucket, uses the AutoML models to extract the relevant data from the documents, and stores the extracted data in BigQuery for further analysis.

Training Data

The training data for this example pipeline is from a public dataset containing PDFs of U.S. and European patent title pages with a corresponding BigQuery table of manually entered data from the title pages. The dataset is hosted by the Google Public Datasets Project.

Part 1: The Training Pipeline

The training pipeline consists of the following steps:

  • Training data is pulled from the BigQuery public dataset. The training BigQuery table includes links to PDF files in Cloud Storage of patents from the United States and European Union.

  • The PDF files are converted to PNG files and uploaded to a new Cloud Storage bucket in your own project. The PNG files will be used to train the AutoML Vision models.

  • The PNG files are run through the Cloud Vision API to create TXT files containing the raw text from the converted PDFs. These TXT files are used to train the AutoML Natural Language models.

  • The links to the PNG or TXT files are combined with the labels and features from the BigQuery table into a CSV file in the training data format required by AutoML. This CSV is then uploaded to a Cloud Storage bucket. Note: This format is different for each type of AutoML model.

  • This CSV is used to create an AutoML dataset and model. Both are named in the format patent_demo_data_%m%d%Y_%H%M%S. Some AutoML models can sometimes take hours to train.

Join Qwiklabs to read the rest of this lab...and more!

  • Get temporary access to the Google Cloud Console.
  • Over 200 labs from beginner to advanced levels.
  • Bite-sized so you can learn at your own pace.
Join to Start This Lab