arrow_back

Explore flat and nested data types in BigQuery

Sign in Join
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses

Explore flat and nested data types in BigQuery

Lab 1 hour 30 minutes universal_currency_alt 2 Credits show_chart Introductory
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses
important icon IMPORTANT:

desktop/labtop icon Make sure to complete this hands-on lab on a desktop/laptop only.

check icon There are only 5 attempts permitted per lab.

quiz target icon As a reminder – it is common to not get every question correct on your first try, and even to need to redo a task; this is part of the learning process.

timer icon Once a lab is started, the timer cannot be paused. After 1 hour and 30 minutes, the lab will end and you’ll need to start again.

tip icon For more information review the Lab technical tips reading.

Activity overview

Understanding data types and structures is essential for working effectively with data in BigQuery. Flat data types store a single value, such as a number or text string, while nested data types can store multiple values or complex structures with various data types. Knowing the difference between flat and nested data types helps you organize and query data in BigQuery efficiently.

In this lab, you'll explore flat and nested data types in BigQuery and learn how to use these data types effectively in queries.

Scenario

You work as a cloud data analyst for TheLook eCommerce, a global clothing company that specializes in innovative fashion and ethical and sustainable sourcing. The head merchandiser, Meredith, has asked for your help with two data-related problems. First, Meredith is having trouble running a query to generate a summary of sales by country and year. Second, they are concerned that ineffective marketing has caused some products to underperform in specific countries, but they’re unsure how to use existing data to gain the insights she needs. To help Meredith, you'll need to apply your knowledge of flat and nested data types.

Here’s how you’ll do this task: First, you’ll explore data types in BigQuery. Next, you’ll troubleshoot a query to learn more about how data types work. Then, you’ll explore nested data types and how they are different from flat data types. Finally, you’ll work with nested data types to track down why one product is not selling in Singapore and deliver valuable insight that can help Meredith get the answers they need.

Setup

Before you click Start Lab

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This practical lab lets you do the activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended).
Note: Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.
  • Time to complete the lab---remember, once you start, you cannot pause a lab.
Note: If you already have your own personal Google Cloud account or project, do not use it for this lab to avoid extra charges to your account.

How to start your lab and sign in to the Google Cloud console

  1. Click the Start Lab button. On the left is the Lab Details panel with the following:

    • Time remaining
    • The Open Google Cloud console button
    • The temporary credentials that you must use for this lab
    • Other information, if needed, to step through this lab
    Note: If you need to pay for the lab, a pop-up opens for you to select your payment method.
  2. Click Open Google Cloud console (or right-click and select Open Link in Incognito Window) if you are running the Chrome browser. The Sign in page opens in a new browser tab.

    Tip: You can arrange the tabs in separate, side-by-side windows to easily switch between them.

    Note: If the Choose an account dialog displays, click Use Another Account.
  3. If necessary, copy the Google Cloud username below and paste it into the Sign in dialog. Click Next.

{{{user_0.username | "Google Cloud username"}}}

You can also find the Google Cloud username in the Lab Details panel.

  1. Copy the Google Cloud password below and paste it into the Welcome dialog. Click Next.
{{{user_0.password | "Google Cloud password"}}}

You can also find the Google Cloud password in the Lab Details panel.

Important: You must use the credentials the lab provides you. Do not use your Google Cloud account credentials. Note: Using your own Google Cloud account for this lab may incur extra charges.
  1. Click through the subsequent pages:
    • Accept the terms and conditions
    • Do not add recovery options or two-factor authentication (because this is a temporary account)
    • Do not sign up for free trials

After a few moments, the Console opens in this tab.

Note: You can view the menu with a list of Google Cloud Products and Services by clicking the Navigation menu at the top-left. Google Cloud console menu with the Navigation menu icon highlighted

Task 1. Explore data types in BigQuery

Artem, the data architect, recommends exploring the table Meredith wants to access before troubleshooting their query. This will help you better understand the data, including the columns and data types.

In this task, you’ll explore a table’s schema to identify the data types and relevant columns and learn more about flat data types.

  1. In the Google Cloud console, from the Navigation menu (Navigation Menu icon), select BigQuery.
Note: The Welcome to BigQuery in the Cloud Console message box may appear, providing links to the quickstart guide and the release notes for UI updates. Click Done to proceed.
  1. In the Explorer pane, expand the list of datasets by clicking the drop-down arrow next to the project ID.
Note: You may need to select a project first. To do so, click Select a project in the Google Cloud console title bar, then select the project link from the Select a project dialog.
  1. Click the drop-down arrow next to the thelook_gcda dataset.
  2. In thelook_gcda dataset, select the orders_by_state table. The table schema displays.
  3. Locate the column names in the schema.
Note: In BigQuery Studio, a column refers to the field name.

  1. Each column has an associated data type. Locate the type in the schema.

A STRING is a sequence of characters that can include letters, numbers, and punctuation. When a STRING contains numbers, it is still read as text.

An INTEGER is a positive or negative whole number. Integers do not contain decimal points.

Both STRINGs and INTEGERs are flat data types because they contain a single value.

  1. Locate the Mode for each column in the schema.

In BigQuery, columns can be NULLABLE, REQUIRED, and REPEATED.

  • NULLABLE means that a column can contain NULL values or values that are missing or unknown.
  • REQUIRED means that a column cannot contain NULL values.
  • REPEATED indicates that the column contains an array of values of the column’s data type.

NULLABLE, REQUIRED, and REPEATED can be used with flat data types and nested RECORDs.

Task 2. Troubleshoot a query

Now that you’ve explored the data, you’re ready to help Meredith with their query. Meredith is trying to generate a summary of sales by country and by year. Currently, they are using the orders_by_state table, but they’re getting an error and can’t find the issue.

In this task, you’ll use your understanding of data types to troubleshoot and fix the query so that Meredith can create a summary of sales.

  1. In the BigQuery action bar, click the Untitled tab to display the Query Editor.
  2. Copy the following query into the Untitled tab:
SELECT sum(total_sales) as sum_total_sales FROM `thelook_gcda.orders_by_state`;

This is the query that Meredith tried to run to get the sum for the total_sales column.

Did you spot an exclamation mark in line one when you entered the query? This is a warning from BigQuery indicating that the query may contain errors. BigQuery scans all queries for potential errors before they are run, and it will alert you to any errors it finds.

The query within editor, which now displays a red error icon.

  1. Click Run.
  2. In the Query Results section, locate the error message. Error messages are useful ways of understanding why a query is not working. In this instance, the error message states that there is "No matching signature for aggregate function SUM for argument types: STRING".

To clarify, the signature of a function describes its parameters and data types. Currently, the SUM() function is taking in a STRING, a flat data type whose value is read as text. However, this is not a supported data type.

To fix the error, you'll need to change the query to convert the data type from a STRING to a data type that includes a number. Since you are working with numbers that may contain a decimal point, the NUMERIC data type is a good choice.

  1. Copy the following updated query into the Query Editor.
Note: Be sure to delete the previous query before copying the updated one. SELECT country, sum(cast(total_sales as numeric)) as sum_total_sales FROM `thelook_gcda.orders_by_state` GROUP BY country;
  1. Click Run.

CAST is a valuable tool to use in SQL because it converts data from one type to another. The query now works because it changes the data type of total_sales from a STRING to NUMERIC, a supported data type that is read as a number and can be used for mathematical calculations. This will resolve the error that was shown.

Now that the issue with total_sales is solved, you can help Meredith get the information they need.

  1. Copy the following query into the Query Editor:
SELECT country, order_year, sum(cast(total_sales as decimal)) as sum_total_sales FROM `thelook_gcda.orders_by_state` GROUP BY country, order_year ORDER BY country asc, order_year desc;

This query returns the orders by country and by year. The query will also sort the results by country name in ascending order, and by year in descending order, ensuring that the latest year appears first.

  1. Click Run.

Click Check my progress to verify that you have completed this task correctly.

Troubleshoot a query

Task 3. Explore nested data types

Meredith has asked you to create a report on the effectiveness of marketing for certain products in 2021. They suspect that a lack of customer awareness might have led to fewer sales in some markets, and a report can help determine if unsuccessful marketing was the cause. You’re unsure what data is needed, so you reach out to Artem, the data architect, for help. Artem explains that Google Analytics captures website views in the ga4_events table, and you can search for an event named view_item in the event_name column to find the data you need.

BigQuery can be tricky, as it contains two main types: arrays and STRUCTs. REPEATED columns are arrays, which means they can store multiple values of the same data type. RECORD columns are structs, which means they can store multiple values of different data types. You can even have an entire array that is made up of STRUCTS called a REPEATED RECORD column.

You thank Artem and get down to work.

In this task, you’ll explore nested data types in the ga4_events table and learn more about nested data types.

  1. In the thelook_gcda dataset, select the ga4_events table, and review the Schema tab.

A STRING, INTEGER, and FLOAT are flat data types, while a RECORD is a nested data type that functions as a special kind of STRUCT. STRUCTs enable you to treat various pieces of data as one unit, even if they are different data types. For instance, the geo RECORD comprises several columns, such as country, region, and city, which together represent a geographical reference for the event, or the location of the customer who viewed the product at the time of the event.

  1. Locate the columns nested under the geo column.

  1. Locate the event_name column. This is where you’ll find the event type you'll need to filter on.

  1. Locate the items column and review its mode.

The Repeated mode is exclusive to nested data types in BigQuery. When a column is marked as Repeated, it indicates that there can be multiple items for every row in the table. Additionally, repeated columns can hold multiple values of varying data types.

Task 4. Work with nested data types

After exploring nested data types, you're almost ready to assist Meredith in understanding how marketing in 2021 affected product sales.

Before that, you reach out to Artem for more advice. Artem tells you that when working with flat data types like STRING, you can reference them directly. But when working with nested data types like RECORD, you'll need to reference them using dot notation.

You thank Artem and start thinking about how to use the data you've explored to help Meredith. Since Meredith noticed that the sales of Google Dino Game Tees in Singapore were much lower than expected, you decide to track down how many people viewed the Dino Tee on the website in that market during 2021. This insight could help Meredith better understand if ineffective marketing may have cost the product sales.

In this task, you’ll generate a report of the number of views for the product Google Dino Game Tee in 2021 for Singapore, a market where very few Dino tees were sold.

  1. Copy the following query into the Query Editor:
SELECT event_date, event_name, geo FROM `thelook_gcda.ga4_events`;

This query retrieves the event_date, event_name, and geo from the ga4_events table and inspect the results.

  1. Click Run.

That is a lot of results. To view only the relevant information, you'll need to filter the events by type.

  1. Copy the following query into the Query Editor:
SELECT event_date, event_name, geo FROM `thelook_gcda.ga4_events` WHERE event_name = 'view_item';

This query returns only the rows that match the view_item event type.

  1. Click Run.

When creating a report, a good strategy is to start with the total number of rows in the dataset. Then, you can add filters to narrow down the results. It’s important to keep track of how many rows the filtered dataset has, so that you can understand how your filters are affecting the results.

To get the results you need for Meredith, you’ll need to filter the results by country. Country is a column nested within the geo-nested data type. To reference a column in a nested data type, you use dot notation.

Note: To use dot notation, add a period (.) between the name of the RECORD column from the name of the nested column that you want to access. For example, if you have a RECORD column called geo that has a nested column called country, the dot notation would be geo.country.
  1. Copy the following query into the Query Editor:
SELECT event_date, event_name, geo.country FROM `thelook_gcda.ga4_events` WHERE event_name = 'view_item' and geo.country = 'Singapore';

This query filters events by country, specifically Singapore. Note how country is shown to be nested inside of geo with geo.country.

  1. Click Run.

Finally, you need to filter on the product name. The following query adds the items column, which has the data type of RECORD and is marked as REPEATED. This means that it is a collection of RECORDs. In other words, the items column is an array of STRUCTs!

  1. Copy the following query into the Query Editor:
SELECT event_date, event_name, geo.country, items FROM `thelook_gcda.ga4_events` WHERE event_name = 'view_item' and geo.country = 'Singapore';
  1. Click Run.

  2. Scroll down to view the results.

Each row numbered under the Row column in the results represents an event, an item view. Notice that some of the event rows have multiple items listed within them, as is the case in row 19. This is how BigQuery shows columns that repeat within the same row.

The number of rows in the results remains the same at 65 since the query is still finding a total of 65 events. However, row 19 in the results set has 9 nested items in the items column. This means that although row 19 corresponds to 1 event, it actually contains 9 items.

Now, let’s try to list the item name in the query since Meredith is interested in the item with the name Google Dino Game Tee.

  1. Copy the following query into the Query Editor:
SELECT event_date, event_name, geo.country, items.item_name FROM `thelook_gcda.ga4_events` WHERE event_name = 'view_item' and geo.country = 'Singapore';
  1. Click Run.
  2. Examine how many rows are returned.

In this instance, none of the rows are returned. Instead, the query returns an error. BigQuery cannot determine which of the nested items is being referenced since items is a collection (in other words, an array that contains multiple values of the STRUCT data type).

Query error message 'Cannot access field item_name on a value with type ARRAY..'

One way to work with nested columns in a query is to simply un-nest those columns so they can be easily referenced. The operation is known as “flattening”, and uses the unnest function that can be referenced like a table.

Note: Flattening converts a nested data type into a flat data type that only contains one value.

The query that follows uses unnest on the column items and defines the item as an alias, and cross-joins with each individual event row. This will select all events where a user viewed an item in Singapore.

  1. Copy the following query into the Query Editor:
SELECT event.event_date, event_name, geo.country, item.item_name FROM `thelook_gcda.ga4_events` as event INNER JOIN unnest(items) as item WHERE event_name = 'view_item' AND geo.country = 'Singapore';
  1. Click Run.

In the previous step, you had 65 rows that matched the view_item event name and the country Singapore. However, after the UNNEST operation, you now have 571 rows! This is because the UNNEST operation flattens the items array, which means that each item in the array is now a separate row.

Note: This time you were able to reference the item_name without an issue. BigQuery copies the event_date, event_name, and geo.country to the newly created rows. If you scroll through the results, you'll notice that every row and column has only one value now. In other words, there are no nested columns; this is a flattened dataset.

Using the flattened dataset, add a filter so only products named Google Dino Game Tee are displayed in the result set.

  1. Copy the following query into the Query Editor:
SELECT event.event_date, event_name, geo.country, item.item_name from `thelook_gcda.ga4_events` as event INNER JOIN unnest(items) as item WHERE event_name = 'view_item' AND geo.country = 'Singapore' AND item.item_name = 'Google Dino Game Tee';
  1. Click Run.

How many rows are returned now? Did the marketing campaign generate a lot of views on the website?

It seems only three events match the criteria. This means that for the entire country of Singapore and the entire year of 2021, only three people looked at the Google Dino Game Tee product on the website. That's not a lot of views for a country with millions of people. It seems like you have valuable insight, so share it with Meredith to help them better understand if marketing impacted the product sales in 2021. Good job!

Click Check my progress to verify that you have completed this task correctly.

Work with nested data types

Summary

  • When working with queries, STRING, and number data types can be directly referenced by their names in the SELECT and WHERE clauses.
  • For columns of type RECORD, such as the geo column in this lab, a two-part name is used, such as geo.country, to reference specific nested columns.
  • Columns marked as REPEATED are collections that can have multiple rows, as demonstrated by the items column in this lab.
  • When dealing with columns in REPEATED mode, special referencing methods are employed in queries.
  • One way to flatten these nested structures is by using the UNNEST operator.

If you’d like to review the concepts covered in this lab, refer to the lesson on data types.

Conclusion

Great work!

As a cloud data analyst at TheLook eCommerce, you used your knowledge of flat and nested data types to help Meredith fix a query and gain insights into a lack of product sales in a specific market.

First, you explored flat data types and applied your knowledge to fix Meredith’s query so that they could run the sales report they needed. Then, you explored nested data types to help Meredith uncover why the Google Game Dino Tees were not selling in Singapore.

You’re well on your way to understanding how to use flat and nested data types to query data with BigQuery.

End your lab

Before you end the lab, make sure you’re satisfied that you’ve completed all the tasks. When you're ready, click End Lab and then click Submit.

Ending the lab will remove your access to the lab environment, and you won’t be able to access the work you've completed in it again.

Copyright 2024 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.