Published on

Data Engineering Zoomcamp | Week 2.4 Configuring GCP for Mage |

Authors

For this blog, it makes most sense to provide the video link that I'll be following:

https://www.youtube.com/watch?v=00LP360iYvE&list=PL3MmuxUbc_hJed7dXYoJw8DoCuVHhGEQb&index=23

I will also provide high-level instructions for reference. Also, if you're using a virtual machine to run Mage on (like me) I'll provide an alternative step compared to the youtube video.

Introduction

In this module, we'll set up Google Cloud components so Mage can read and write data to both Google Cloud Storage and Google BigQuery. The rest of the tutorial will focus on Mage, but first, we'll handle the Google Cloud setup.

Create GCP Bucket

  1. Create a Google Cloud Storage Bucket:
    • Search for Google Cloud Storage and go to the Buckets section.
    • Create a new bucket, naming it uniquely (e.g., Mage Zoom Camp Matt Polymer 3).
    • Select multi-region and enforce public access prevention.
    • Confirm and finalize the creation.

Create Service Account

  1. Create a Service Account:

    • Search for and go to Service Accounts in the IAM & admin section.
    • Create a new service account named mage-zoom-camp (for instance).
    • Assign the owner role to allow full access to Google Cloud resources.
      • Note: if this role is just for the tutorial. In practice, we'll need to identify the exact permission required for access.
    • Finish the setup and generate a JSON key file, which will be downloaded to your computer.
  2. Add Service Account Key to Mage:

    • Place the downloaded JSON key file into your Mage project directory.
      • Note: if you're using a VM and downloaded this on your local machine. I'd advice using SFTP. If you've setup your virtual machine using SSH and a config file, this is a straight forward as saying SFTP {**hostname from config file**}
    • Define the volume in the Docker file so the credentials are available in Mage.
      • Recall in the docker-compose.yml two services were created. One was Mage and one was Postgres. Inside the Mage services configuration, there was a volume mounted like so: .:/home/src. This means that anything any inside this directory is pulled and read into Mage. Therefore, this is the location we need to move the credentials file.

Authentication

  1. Authenticate with Mage:
    • Open Mage i.e. navigate to localhost:6789, and go to the IO config.
    • Choose the method to authenticate with the service account key file path.
    • Ensure the correct path is provided, so Mage can use these credentials to interact with Google Cloud.

BigQuery and Google Cloud Storage Testing

  1. Test BigQuery:

    • Switch to BigQuery in your Mage pipeline configuration.
    • Run a simple query to ensure it connects and retrieves data from BigQuery.
  2. Test Google Cloud Storage:

    • Create an example pipeline that processes and saves a CSV file locally.
    • Upload this file to your Google Cloud Storage bucket.
    • Configure a data loader in Mage to read this file from the bucket and ensure it loads correctly.

With these steps completed, Mage is set up to interact with both Google Cloud Storage and BigQuery, allowing for the creation of complex data pipelines in future modules. Stay tuned for the next video where we'll build these pipelines.

Running Checks

Service Account File Present?

Check the following is present in the io.config file in Mage: w2_4_1

Note the path is the mounted path in from the host machine of the docker container.

Testing connection to GCP

1. Check we can connect to BigQuery.

Here we'll test the connection by running a SQL query on BigQuery. Since our ServiceAccount was setup as an owner, we'll should be able to connect to BigQuery. w2_4_2

I selected BigQuery, dev, and use raw sql. For my connection, profile, and option respectively. After running the block, you will see in the output if you can connect. In my case, it was successful.

2. Check we can connect to GCS storage for I/O operations:

  1. Read data into mage from GCS:

    1. Upload any CSV to GCP through the console itself.
    w2_4_3

    Here I've highlighted two pieces of information that we'll need to use in order to read this CSV into Mage. One is the bucket_name and the other is object_key.

    1. In Mage create a data loader block. Configure it as a Python block and select Google Cloud Storage. This will provide us with the template we need:
    w2_4_4 w2_4_5

    All I needed to do was supply the bucket_name, the object_key, and change the profile to dev. It successfully read my test file into Mage!

  2. Load data from mage to GCS

    1. Let's take the existing block, add a column, and then write it back to GCS. To do this, we'll add a transformer block, connect it to the block we've created, and then configure it to add a column. Lastly, we'll create a data_exporter block and write out to GCS: w2_4_6 w2_4_7 w2_4_8

    In the pictures above you can see the configuration of each step. In the output of the data loader and in GCS we can see a file was successfully created! This means we can access GCS for I/O and we can also access BigQuery 😌.

    I'll catch you in the next blog where we'll be performing ETL: API to GCS. We'll level it up by partioning our data and writing out multiple files simultaneously into GCS!