arrow_back

Identify different batch and streaming data sources

Sign in Join
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses

Identify different batch and streaming data sources

Lab 1 hour 30 minutes universal_currency_alt 2 Credits show_chart Introductory
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses
important icon IMPORTANT:

desktop/labtop icon Make sure to complete this hands-on lab on a desktop/laptop only.

check icon There are only 5 attempts permitted per lab.

quiz target icon As a reminder – it is common to not get every question correct on your first try, and even to need to redo a task; this is part of the learning process.

timer icon Once a lab is started, the timer cannot be paused. After 1 hour and 30 minutes, the lab will end and you’ll need to start again.

tip icon For more information review the Lab technical tips reading.

Activity overview

Timely access to data is critical for organizations to respond quickly to market changes, customer needs, and operational issues.

Batch processing is a method of collecting large volumes of data over a period of time, then processing it all at once. It is best for processing large amounts of data and tasks that do not require near real-time processing.

Streaming processing is a method of processing data as it is received. It is best for processing data continuously in real time. It is important for data analysts and data scientists to understand the difference between the two since both processes have their own advantages and disadvantages.

As a data analyst, knowing when and how to apply batch processing or streaming processing helps you optimize the performance of data processing tasks, minimize time delay in data processing, and provide more accurate and timely data insights.

In this lab, you’ll process and collect data for a specific purpose, and observe the data loading in BigQuery tables using both computer-assisted batch processing and stream processing methods.

Scenario

You’ve been asked to help Meredith, the lead merchandiser at TheLook eCommerce, monitor the results of their promotions and price changes.

Merchandisers set prices and make sure they sell the inventory they buy. For example, if t-shirts for a sports team are overstocked in a championship year, a retailer might decide to lower the price so the merchandise sells faster. That way, the store does not end up with unsold stock.

For this task, Meredith needs to view the number of items added to shopping carts two ways. Near real-time monitoring will allow Meredith to view the number of items added to shopping carts as they are available.

Minute by minute monitoring will allow Meredith to view the number of items added to each shopping cart in increments of one minute.

This data will help Meredith track the effectiveness of their promotions and price changes over time. They can then use this information to help improve the shopping experience for their customers and increase sales.

Artem, the data architect, points out that shopping cart activity is streamed into one of the tables in BigQuery. You need to help Meredith understand how to find the data she needs to monitor their merchandise.

Here’s how you'll do this task: First, you’ll search for your dataset and associated table. Next, you’ll run a query to display the time each product was added to the shopping cart. Then, you’ll rerun the query. Finally, you’ll examine the properties of the shopping_cart and order tables.

Setup

Before you click Start Lab

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This practical lab lets you do the activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended).
Note: Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.
  • Time to complete the lab---remember, once you start, you cannot pause a lab.
Note: If you already have your own personal Google Cloud account or project, do not use it for this lab to avoid extra charges to your account.

How to start your lab and sign in to the Google Cloud console

  1. Click the Start Lab button. On the left is the Lab Details panel with the following:

    • Time remaining
    • The Open Google Cloud console button
    • The temporary credentials that you must use for this lab
    • Other information, if needed, to step through this lab
    Note: If you need to pay for the lab, a pop-up opens for you to select your payment method.
  2. Click Open Google Cloud console (or right-click and select Open Link in Incognito Window) if you are running the Chrome browser. The Sign in page opens in a new browser tab.

    Tip: You can arrange the tabs in separate, side-by-side windows to easily switch between them.

    Note: If the Choose an account dialog displays, click Use Another Account.
  3. If necessary, copy the Google Cloud username below and paste it into the Sign in dialog. Click Next.

{{{user_0.username | "Google Cloud username"}}}

You can also find the Google Cloud username in the Lab Details panel.

  1. Copy the Google Cloud password below and paste it into the Welcome dialog. Click Next.
{{{user_0.password | "Google Cloud password"}}}

You can also find the Google Cloud password in the Lab Details panel.

Important: You must use the credentials the lab provides you. Do not use your Google Cloud account credentials. Note: Using your own Google Cloud account for this lab may incur extra charges.
  1. Click through the subsequent pages:
    • Accept the terms and conditions
    • Do not add recovery options or two-factor authentication (because this is a temporary account)
    • Do not sign up for free trials

After a few moments, the Console opens in this tab.

Note: You can view the menu with a list of Google Cloud Products and Services by clicking the Navigation menu at the top-left. Google Cloud console menu with the Navigation menu icon highlighted

Task 1. Determine batch versus streaming sources

In this task, you’ll help Meredith create a table that stores data about the number of orders and the time each order was added to a customer's cart.

  1. In the Google Cloud console Navigation menu (Navigation Menu), select BigQuery.
Note: The Welcome to BigQuery in the Cloud Console message box may appear, providing links to the quickstart guide and the release notes for UI updates. Click Done to proceed.
  1. Expand the list of datasets by clicking the drop-down arrow next to the Project ID.
Note: You may need to select a project first. To do so, click Select a project in the Google Cloud console title bar, then select the project link from the Select a project dialog.

If you get a pop up, click Ok.

  1. Find a dataset named thelook_gcda, and click the drop-down arrow next to it.

  2. Select the table named shopping_cart.

  3. Click the Query button to open the Query Editor, and select In a new tab. A pre populated Untitled tab opens.

  4. Replace the pre populated data by copying and pasting the following query in the Query Editor:

    SELECT * FROM `thelook_gcda.shopping_cart` ORDER BY created_at DESC LIMIT 10;

This query displays the first 10 rows of the shopping_cart table.

  1. Click Run.

The query results should display in table format below the Query Editor.

Hint: In the Query Editor you can click on the "+" icon to compose a new query.

  1. Note the time of the most recent entry, which will be displayed at the top of the table in Row 1.
  2. Click the Run button to run the query again.

Data is continuously added to this table, so queries can be run on the latest data as soon as it is available. This changes the output of the query.

  1. Click on the shopping_cart table in the Explorer area, then click on the Details tab in the query area. Examine the shopping_cart table properties in BigQuery. The section Streaming buffer statistics is an indication that data is being streamed into the table.
Tip: Refresh the page if it does not display right away. You may need to scroll through to navigate the statistics.
  1. Now, examine the properties of the orders table. Click on the orders table in the Explorer area, then click on the Details tab in the query area.

  1. Copy the following into the Query Editor:
SELECT * FROM `thelook_gcda.orders` ORDER BY created_at DESC LIMIT 10;
  1. Click Run display the most recent rows on the orders table.
  2. Click Run again to examine if there is a change in the result.

  1. Copy the following query into the Query Editor:
SELECT p.category, FORMAT_TIMESTAMP("%H:%M", sc.created_at) as added_at_minute, sum(sc.quantity) as sum_quantity FROM `thelook_gcda.shopping_cart` sc INNER JOIN `thelook_gcda.products` p ON p.id = sc.product_id WHERE p.category = 'Jeans' AND sc.created_at > timestamp_sub(current_timestamp(), INTERVAL 1 HOUR) GROUP BY p.category, added_at_minute ORDER BY added_at_minute DESC;

This query displays the number of products from the 'Jeans' product category that were added to the shopping_cart table each minute within the last hour.

  1. Run the query at least three times, waiting about ten seconds between runs, and observe the results.

This query is an example of a data source for a dashboard. Because the query is based on the shopping_cart table, which is constantly updated through a streaming source, the dashboard will get fresh data every time the query runs.

Click Check my progress to verify that you have completed this task correctly. Determine batch versus streaming sources

Now, let’s learn more about this query and its main components.

  1. Locate the names of the tables used in this query.

The query joins the thelook_gcda.shopping_cart and thelook_gcda.products tables using the product_id column. This ensures that only rows that match the product id on both tables are included in the results.

  1. Identify the type of join used in the query:
FROM `thelook_gcda.shopping_cart` sc INNER JOIN `thelook_gcda.products` p ON p.id = sc.product_id;

An INNER JOIN is used in this query. A shopping cart row with a value in the product_id that does not exist in the products table will not be included in the results, and therefore will not be counted.

Because Meredith asked for minute-by-minute information, the query formats the created_at column as HH:MM, for example 10:15 for a quarter past ten in the morning.

  1. Identify the following statement in the SELECT portion of the query:
FORMAT_TIMESTAMP("%H:%M", sc.created_at) AS added_at_minute,

Formatting dates and times is a very common task for a cloud data analyst. You can find more information and examples on how to use the built-in FORMAT_TIMESTAMP function in BigQuery’s Timestamp functions documentation.

  1. Locate the WHERE statement in this query.

The WHERE statement in this query filters shopping cart items by Product Category and by the time it was created:

WHERE p.category = 'Jeans' AND sc.created_at > timestamp_sub(current_timestamp(), INTERVAL 1 HOUR);

Since Meredith is only interested in Jeans, the following part of the query removes products not in that category with the first part of the WHERE clause:

p.category = 'Jeans'

The shopping cart table is very active and can be quite large, so Meredith also asked to see data only for the last hour. To accomplish this, the WHERE clause filters rows using the created_at column and uses two built in functions, one to get the date and time when the query runs (current_timestamp) and one to calculate the time for one hour ago (timestamp_add). The query will only return rows that were created before now minus one hour. Now is determined by the current_timestamp function. Minus one hour is done by subtracting 1 hour from the current time.

Note: The time displayed in query results is in the Universal Time Coordinated (UTC) time zone, and it might not align with the time at your location. Storing time in UTC is a common practice in global companies.

For more information, refer to the Timestamp functions documentation.

  1. Locate the following line in the query:
GROUP BY p.category, added_at_minute;

The following line groups the results by minute and adds up the quantities added for each minute because we used the SUM() function in our SELECT statement. Since you need to list the Product Category in the results, you must also group by Product Category.

Notice that you are using the alias of the added_at_minute column that was defined in the SELECT portion of the query.

  1. Locate the ORDER BY clause in the query:
ORDER BY added_at_minute DESC;

The ORDER BY clause populates the results with the most recent available data, by the minute, at the top.

Conclusion

Great work!

You were able to locate the information Meredith needed to stay on top of customer buying trends, which will help them make timely business decisions about pricing and stock.

You did this by first determining if the table was using batch or streaming processing, then exploring and running a query to populate a table with data about the number of jeans that were added to shopping carts in the last hour, by minute.

You also learned about the importance of timely data access and the difference between batch processing and streaming processing.

By observing data loading in BigQuery tables using both methods, you developed the skills to choose the appropriate approach for data processing.

End your lab

Before you end the lab, make sure you’re satisfied that you’ve completed all the tasks. When you're ready, click End Lab and then click Submit.

Ending the lab will remove your access to the lab environment, and you won’t be able to access the work you've completed in it again.

Copyright 2024 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.