Automate Data Pipeline Creation

Simplify creation of streaming data pipelines with a guided UI

PreviousUse Cases NextEncryption for GCP

Last updated 5 hours ago

In this tutorial, we'll walk through how a platform team can build a guided experience for creating modern, streaming data pipelines. Empower data engineers to spin up Pub/Sub, Cloud Run Functions, and Cloud Storage via a guided UI that emits infrastructure as code.

To follow along with this guide, sign up for a free account of Resourcely here. Once you have signed up, navigate to the Foundry.

Goals and outcomes

Consider a company with many data engineers. Perhaps they are using an expensive ETL tool, or they want to move towards event-based data pipelines. In this scenario, these engineers want to deploy new data pipelines but may not be cloud infrastructure experts.

This company's platform team can streamline configuration of these data pipelines by using Resourcely Blueprints and Guardrails. The result will be a guided, UI-based experience for developers that will allow them to generate properly configured infrastructure as code.

Developers: deploy faster, on their own, without mistakes.

Platform teams: create trusted patterns, without being stuck in support & operations work.

Architecture

We will automate data pipeline creation with the following GCP stack:

Pub/Sub for a message queue
Cloud Run Functions for custom code computation and transformation
Cloud Storage to store the results and Cloud Run Function

A similar set of Blueprints and Guardrails could apply to AWS services: SNS, Lambdas, and S3.

Terraform Example

The following Terraform code will create a one-off data pipeline. This would require developers that are comfortable with Terraform and the cloud services options that can be set within it.

Original Terraform code

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = ">= 4.34.0"
    }
  }
}

resource "random_id" "bucket_prefix" {
  byte_length = 8
}


resource "google_service_account" "default" {
  account_id   = "test-gcf-sa"
  display_name = "Test Service Account"
}

resource "google_pubsub_topic" "default" {
  name = "functions2-topic"
}

resource "google_storage_bucket" "default" {
  name                        = "${random_id.bucket_prefix.hex}-gcf-source" # Every bucket name must be globally unique
  location                    = "US"
  uniform_bucket_level_access = true
}

data "archive_file" "default" {
  type        = "zip"
  output_path = "/tmp/function-source.zip"
  source_dir  = "function-source/"
}

resource "google_storage_bucket_object" "default" {
  name   = "function-source.zip"
  bucket = google_storage_bucket.default.name
  source = data.archive_file.default.output_path # Path to the zipped function source code
}

resource "google_cloudfunctions2_function" "default" {
  name        = "function"
  location    = "us-central1"
  description = "a new function"

  build_config {
    runtime     = "nodejs16"
    entry_point = "helloPubSub" # Set the entry point
    environment_variables = {
      BUILD_CONFIG_TEST = "build_test"
    }
    source {
      storage_source {
        bucket = google_storage_bucket.default.name
        object = google_storage_bucket_object.default.name
      }
    }
  }

  service_config {
    max_instance_count = 3
    min_instance_count = 1
    available_memory   = "256M"
    timeout_seconds    = 60
    environment_variables = {
      SERVICE_CONFIG_TEST = "config_test"
    }
    ingress_settings               = "ALLOW_INTERNAL_ONLY"
    all_traffic_on_latest_revision = true
    service_account_email          = google_service_account.default.email
  }

  event_trigger {
    trigger_region = "us-central1"
    event_type     = "google.cloud.pubsub.topic.v1.messagePublished"
    pubsub_topic   = google_pubsub_topic.default.id
    retry_policy   = "RETRY_POLICY_RETRY"
  }
}

This code creates the following Resources:

Google Service Accounts
Google Pub/Sub Topic
Google Storage Bucket
Google Storage Bucket Object
Google Cloud Run Function

Inside the bucket, a .zip file is stored that contains the code that the Cloud Run Function will execute. This code is very rigid as-is:

Assumes node.js code for the function
Hard codes a variety of parameters
- 256 megabytes of memory
- 60 second timeout
- Maximum of 3 instances
- Hardcoded environment variables

Converting to Resourcely Blueprint

We will take this code and turn it into a dynamic, interactive template where developers can use a UI to deploy. Here's a preview of what the UI we will generate will look like:

Let's walk through our resources step-by-step and convert it to a Resourcely Blueprint!

General Purpose Blueprint Code

First, we'll add some general purpose variables to our Blueprint. Resourcely Blueprint code is structured first with frontmatter, where variables and their tags are defined. Then, templated Terraform that references our frontmatter variables are structured.

constants:
  "
variables:
  
    desc: |
      **The base name for all resources in this Blueprint.**
      - Must be unique within the GCP project
    required: true
    suggest: "my-cloud-function"
    group: General
  
    desc: |
      **The location for all GCP resources in this Blueprint.**
      - Must be a valid GCP region
    required: true
    suggest: "us-central1"
    group: General

Here, we create a name and location variable and a special __name constant. The desc, required, suggest, and group tags will all impact behavior in the resulting generated UI:

Variables can be referenced multiple times through Blueprints. You'll see {{ name }} and {{ __name }}, which will use the input from this single UI field.

At the end of the frontmatter we also define groups, logical organization of input fields:

Groups in frontmatter

groups:
  General:
    desc: |
      General configuration for the Blueprint, including resource names and locations.
    order: 1
  Service Account:
    desc: |
      Configuration related to the Google Cloud service account used by the Cloud Function.
    order: 2
  Pub/Sub:
    desc: |
      Configuration for the Pub/Sub topic that triggers the Cloud Function.
    order: 3
  Storage:
    desc: |
      Configuration for the storage bucket used to store Cloud Function source code.
    order: 4
  Archive:
    desc: |
      Configuration for archiving the Cloud Function source code into a deployable format.
    order: 5
  Cloud Function:
    desc: |
      Configuration for the Cloud Function, including runtime settings and environment variables.
    order: 6
  Advanced:
    desc: |
      Advanced configuration options for testing, ingress settings, and retry policies.
    order: 7

Service Accounts

We'll now move to the service account resource, which are used to logically separate different resources within GCP.

// Frontmatter

  service_account_name:
    desc: |
      **The name of the service account to be created.**
      - Must be unique within the project
    required: true
    suggest: "test-gcf-sa"
    group: Service Account
    
// Inline variable reference

resource "google_service_account" "{{ __name }}" {
  account_id   = "{{ service_account_name }}"
  display_name = "Service Account for {{ name }}"
}

resource "google_service_account" "default" {
  account_id   = "test-gcf-sa"
  display_name = "Test Service Account"
}

Notice that the google_service_account resource references {{ __name }} for a globally unique name. {{ service_account_name }}, which we also defined in the frontmatter, is then referenced for the account_id parameter.

Here's what the UI looks like for the variable we defined:

Pub/Sub

Pub/Sub is relatively simple, although we'll come back to this later to make it more secure.

// Frontmatter

  pubsub_topic_name:
    desc: |
      **The name of the Pub/Sub topic to be created.**
      - Must be unique within the project
    required: true
    suggest: "functions2-topic"
    group: Pub/Sub

// Inline

resource "google_pubsub_topic" "{{ __name }}" {
  name = "{{ pubsub_topic_name }}"
}

resource "google_pubsub_topic" "default" {
  name = "functions2-topic"
}

Note that we've introduced a single variable called pubsub_topic_name. Our description and suggestion give the user critical context they may have been missing:

Name must be unique, but only within the project
The suggestion lets the user know the expected format

Storage Bucket and Bucket Object

Our Storage Bucket is for hosting function results, as well as the function code that will be executed. The original Terraform code had many hard-coded fields without much guidance:

File type and path for the function
Hardcoded location

// Frontmatter
  bucket_prefix:
    desc: |
      **The prefix for the storage bucket name.**
      - Ensures globally unique bucket names
    required: true
    suggest: "random-prefix"
    group: Storage
  bucket_object_name:
    desc: |
      **The name of the object stored in the bucket.**
      - Typically the zip file for Cloud Function source code
    required: true
    suggest: "function-source.zip"
    group: Storage
  archive_file_type:
    desc: |
      **The archive file type.**
      - Must be a valid `zip`
    required: true
    suggest: "zip"
    group: Archive
  archive_output_path:
    desc: |
      **The output path for the archived file.**
      - Path where the archive will be stored
    required: true
    suggest: "/tmp/function-source.zip"
    group: Archive
  archive_source_dir:
    desc: |
      **The source directory for the archived file.**
      - Path to the directory containing the function source code
    required: true
    suggest: "function-source/"
    group: Archive

// Inline
resource "google_storage_bucket" "{{ __name }}" {
  name                        = "{{ bucket_prefix }}-gcf-source"
  location                    = "{{ location }}"
  uniform_bucket_level_access = true
}

data "archive_file" "{{ __name }}" {
  type        = "{{ archive_file_type }}"
  output_path = "{{ archive_output_path }}"
  source_dir  = "{{ archive_source_dir }}"
}

resource "google_storage_bucket_object" "{{ __name }}" {
  name   = "{{ bucket_object_name }}"
  bucket = google_storage_bucket.{{ __name }}.name
  source = data.archive_file.{{ __name }}.output_path
}

resource "google_storage_bucket" "default" {
  name                        = "${random_id.bucket_prefix.hex}-gcf-source" # Every bucket name must be globally unique
  location                    = "US"
  uniform_bucket_level_access = true
}

data "archive_file" "default" {
  type        = "zip"
  output_path = "/tmp/function-source.zip"
  source_dir  = "function-source/"
}

resource "google_storage_bucket_object" "default" {
  name   = "function-source.zip"
  bucket = google_storage_bucket.default.name
  source = data.archive_file.default.output_path # Path to the zipped function source code
}

The Resourcely Blueprint code introduces variables for input, while giving guidance to the user. Note also the use of {{ location }}, a variable previously defined. With this behavior, we don't require the user to choose location multiple times.

Cloud Run Function

Finally we come to our Cloud Run Function, where we are introducing the most guidance and flexibility for users. In this section, we focus on enumerating possible options for users to choose, in the description field as well as in the form pick lists.

Function runtime: A user may not have been familiar with node.js, and may have not known how to utilize python code for their function.
Timeout: Flexibility to choose a timeout, and guidance on the length of time
Function entry point: Ensuring the user knows that their code needs an entry point, and that it is matches their code
Available memory: The amount of RAM to dedicate to the function, with an indicative format to guide users
Ingress: Optional ingress settings, and the available options
Retry policy: The policy on retries, and the two possible options
Environment variables: Optional environment variables the users can take advantage of in their code.

// Frontmatter
function_runtime:
    desc: |
      **The runtime for the Cloud Function.**
      - Default: `nodejs16`
      - Other options: `nodejs14`, `python39`, `go116`, ...
      - See [GCP Supported Runtimes](https://cloud.google.com/functions/docs/concepts/exec#runtimes)
    required: true
    suggest: "nodejs16"
    group: Cloud Function
  function_entry_point:
    desc: |
      **The entry point for the Cloud Function.**
      - Matches the exported function in your source code
    required: true
    suggest: "helloPubSub"
    group: Cloud Function
  available_memory:
    desc: |
      **The memory available for the Cloud Function.**
      - Default: `256M`
      - Other options: `128M`, `512M`, `1G`, ...
    required: true
    suggest: "256M"
    group: Cloud Function
  timeout_seconds:
    desc: |
      **The timeout for the Cloud Function.**
      - Default: `60`
      - Maximum: `540`
    required: true
    suggest: 60
    group: Cloud Function
  build_config_test:
    desc: |
      **Test environment variable for the build configuration.**
      - Used for advanced build testing
      - Feel free to change to match your function
    required: false
    suggest: "build_test"
    group: Advanced
  service_config_test:
    desc: |
      **Test environment variable for the service configuration.**
      - Used for advanced service testing
      - Feel free to change to match your function
    required: false
    suggest: "config_test"
    group: Advanced
  ingress_settings:
    desc: |
      **Ingress settings for the Cloud Function.**
      - Default: `ALLOW_INTERNAL_ONLY`
      - Other options: `ALLOW_ALL`, `ALLOW_INTERNAL_AND_GCLB`
    required: false
    suggest: "ALLOW_INTERNAL_ONLY"
    group: Advanced
  retry_policy:
    desc: |
      **Retry policy for the Cloud Function event trigger.**
      - Default: `RETRY_POLICY_RETRY`
      - Other options: `RETRY_POLICY_DO_NOT_RETRY`
    required: false
    suggest: "RETRY_POLICY_RETRY"
    group: Advanced

// Inline
resource "google_cloudfunctions2_function" "{{ __name }}" {
  name        = "{{ name }}"
  location    = "{{ location }}"
  description = "A new function for {{ name }}"

  build_config {
    runtime     = "{{ function_runtime }}"
    entry_point = "{{ function_entry_point }}"
    environment_variables = {
      BUILD_CONFIG_TEST = "{{ build_config_test }}"
    }
    source {
      storage_source {
        bucket = google_storage_bucket.{{ __name }}.name
        object = google_storage_bucket_object.{{ __name }}.name
      }
    }
  }

  service_config {
    max_instance_count = 3
    min_instance_count = 1
    available_memory   = "{{ available_memory }}"
    timeout_seconds    = {{ timeout_seconds }}
    environment_variables = {
      SERVICE_CONFIG_TEST = "{{ service_config_test }}"
    }
    ingress_settings               = "{{ ingress_settings }}"
    all_traffic_on_latest_revision = true
    service_account_email          = google_service_account.{{ __name }}.email
  }

  event_trigger {
    trigger_region = "{{ location }}"
    event_type     = "google.cloud.pubsub.topic.v1.messagePublished"
    pubsub_topic   = google_pubsub_topic.{{ __name }}.id
    retry_policy   = "{{ retry_policy }}"
  }
}

resource "google_cloudfunctions2_function" "default" {
  name        = "function"
  location    = "us-central1"
  description = "a new function"

  build_config {
    runtime     = "nodejs16"
    entry_point = "helloPubSub" # Set the entry point
    environment_variables = {
      BUILD_CONFIG_TEST = "build_test"
    }
    source {
      storage_source {
        bucket = google_storage_bucket.default.name
        object = google_storage_bucket_object.default.name
      }
    }
  }

  service_config {
    max_instance_count = 3
    min_instance_count = 1
    available_memory   = "256M"
    timeout_seconds    = 60
    environment_variables = {
      SERVICE_CONFIG_TEST = "config_test"
    }
    ingress_settings               = "ALLOW_INTERNAL_ONLY"
    all_traffic_on_latest_revision = true
    service_account_email          = google_service_account.default.email
  }

  event_trigger {
    trigger_region = "us-central1"
    event_type     = "google.cloud.pubsub.topic.v1.messagePublished"
    pubsub_topic   = google_pubsub_topic.default.id
    retry_policy   = "RETRY_POLICY_RETRY"
  }
}

Full Blueprint Code

Putting all of our Blueprint code together looks like the below. You can paste this directly into Resourcely Foundry to immediately publish and get started.

Full Blueprint Code

---
constants:
  __name: "{{ name }}_{{ __guid }}"
variables:
  name:
    desc: |
      **The base name for all resources in this Blueprint.**
      - Must be unique within the GCP project
    required: true
    suggest: "my-cloud-function"
    group: General
  location:
    desc: |
      **The location for all GCP resources in this Blueprint.**
      - Must be a valid GCP region
    required: true
    suggest: "us-central1"
    group: General
  service_account_name:
    desc: |
      **The name of the service account to be created.**
      - Must be unique within the project
    required: true
    suggest: "test-gcf-sa"
    group: Service Account
  pubsub_topic_name:
    desc: |
      **The name of the Pub/Sub topic to be created.**
      - Must be unique within the project
    required: true
    suggest: "functions2-topic"
    group: Pub/Sub
  bucket_prefix:
    desc: |
      **The prefix for the storage bucket name.**
      - Ensures globally unique bucket names
    required: true
    suggest: "random-prefix"
    group: Storage
  bucket_object_name:
    desc: |
      **The name of the object stored in the bucket.**
      - Typically the zip file for Cloud Function source code
    required: true
    suggest: "function-source.zip"
    group: Storage
  archive_file_type:
    desc: |
      **The archive file type.**
      - Must be a valid `zip`
    required: true
    suggest: "zip"
    group: Archive
  archive_output_path:
    desc: |
      **The output path for the archived file.**
      - Path where the archive will be stored
    required: true
    suggest: "/tmp/function-source.zip"
    group: Archive
  archive_source_dir:
    desc: |
      **The source directory for the archived file.**
      - Path to the directory containing the function source code
    required: true
    suggest: "function-source/"
    group: Archive
  function_runtime:
    desc: |
      **The runtime for the Cloud Function.**
      - Default: `nodejs16`
      - Other options: `nodejs14`, `python39`, `go116`, ...
      - See [GCP Supported Runtimes](https://cloud.google.com/functions/docs/concepts/exec#runtimes)
    required: true
    suggest: "nodejs16"
    group: Cloud Function
  function_entry_point:
    desc: |
      **The entry point for the Cloud Function.**
      - Matches the exported function in your source code
    required: true
    suggest: "helloPubSub"
    group: Cloud Function
  available_memory:
    desc: |
      **The memory available for the Cloud Function.**
      - Default: `256M`
      - Other options: `128M`, `512M`, `1G`, ...
    required: true
    suggest: "256M"
    group: Cloud Function
  timeout_seconds:
    desc: |
      **The timeout for the Cloud Function.**
      - Default: `60`
      - Maximum: `540`
    required: true
    suggest: 60
    group: Cloud Function
  build_config_test:
    desc: |
      **Test environment variable for the build configuration.**
      - Used for advanced build testing
      - Feel free to change to match your function
    required: false
    suggest: "build_test"
    group: Advanced
  service_config_test:
    desc: |
      **Test environment variable for the service configuration.**
      - Used for advanced service testing
      - Feel free to change to match your function
    required: false
    suggest: "config_test"
    group: Advanced
  ingress_settings:
    desc: |
      **Ingress settings for the Cloud Function.**
      - Default: `ALLOW_INTERNAL_ONLY`
      - Other options: `ALLOW_ALL`, `ALLOW_INTERNAL_AND_GCLB`
    required: false
    suggest: "ALLOW_INTERNAL_ONLY"
    group: Advanced
  retry_policy:
    desc: |
      **Retry policy for the Cloud Function event trigger.**
      - Default: `RETRY_POLICY_RETRY`
      - Other options: `RETRY_POLICY_DO_NOT_RETRY`
    required: false
    suggest: "RETRY_POLICY_RETRY"
    group: Advanced
groups:
  General:
    desc: |
      General configuration for the Blueprint, including resource names and locations.
    order: 1
  Service Account:
    desc: |
      Configuration related to the Google Cloud service account used by the Cloud Function.
    order: 2
  Pub/Sub:
    desc: |
      Configuration for the Pub/Sub topic that triggers the Cloud Function.
    order: 3
  Storage:
    desc: |
      Configuration for the storage bucket used to store Cloud Function source code.
    order: 4
  Archive:
    desc: |
      Configuration for archiving the Cloud Function source code into a deployable format.
    order: 5
  Cloud Function:
    desc: |
      Configuration for the Cloud Function, including runtime settings and environment variables.
    order: 6
  Advanced:
    desc: |
      Advanced configuration options for testing, ingress settings, and retry policies.
    order: 7
---

# Service Account
resource "google_service_account" "{{ __name }}" {
  account_id   = "{{ service_account_name }}"
  display_name = "Service Account for {{ name }}"
}

# Pub/Sub Topic
resource "google_pubsub_topic" "{{ __name }}" {
  name = "{{ pubsub_topic_name }}"
}

# Storage Bucket
resource "google_storage_bucket" "{{ __name }}" {
  name                        = "{{ bucket_prefix }}-gcf-source"
  location                    = "{{ location }}"
  uniform_bucket_level_access = true
}

# Archive File
data "archive_file" "{{ __name }}" {
  type        = "{{ archive_file_type }}"
  output_path = "{{ archive_output_path }}"
  source_dir  = "{{ archive_source_dir }}"
}

# Storage Bucket Object
resource "google_storage_bucket_object" "{{ __name }}" {
  name   = "{{ bucket_object_name }}"
  bucket = google_storage_bucket.{{ __name }}.name
  source = data.archive_file.{{ __name }}.output_path
}

# Cloud Function
resource "google_cloudfunctions2_function" "{{ __name }}" {
  name        = "{{ name }}"
  location    = "{{ location }}"
  description = "A new function for {{ name }}"

  build_config {
    runtime     = "{{ function_runtime }}"
    entry_point = "{{ function_entry_point }}"
    environment_variables = {
      BUILD_CONFIG_TEST = "{{ build_config_test }}"
    }
    source {
      storage_source {
        bucket = google_storage_bucket.{{ __name }}.name
        object = google_storage_bucket_object.{{ __name }}.name
      }
    }
  }

  service_config {
    max_instance_count = 3
    min_instance_count = 1
    available_memory   = "{{ available_memory }}"
    timeout_seconds    = {{ timeout_seconds }}
    environment_variables = {
      SERVICE_CONFIG_TEST = "{{ service_config_test }}"
    }
    ingress_settings               = "{{ ingress_settings }}"
    all_traffic_on_latest_revision = true
    service_account_email          = google_service_account.{{ __name }}.email
  }

  event_trigger {
    trigger_region = "{{ location }}"
    event_type     = "google.cloud.pubsub.topic.v1.messagePublished"
    pubsub_topic   = google_pubsub_topic.{{ __name }}.id
    retry_policy   = "{{ retry_policy }}"
  }
}

Resulting UI

We now have a fully functioning form that developers can interact with. As covered above, the form has extensive guidance around possible input values, minimums and maximums, formatting, and other tips. When a user fills out this form, their infrastructure as code (and data pipeline!) will be created for them automatically and deployed using your existing CI/CD process.

Adding Guardrails

Now that we have created a streamlined configuration experience, we have unblocked data engineers to create their own data pipelines.

However, you may want to also put more strict controls in place. What if a developer wants to deploy Terraform outside of Resourcely, or if you want to require approval to use a language other than node.js in your Cloud Run Function?

We can accomplish this with Resourcely Guardrails

Allowed Region

Guardrails are written with Really, the Resourcely policy-as-code language. You can learn more about Writing Guardrails with Really.

If we want to control what GCP location can be used for our data pipeline, we could control this with Guardrails. The following Guardrail can be published using Resourcely Foundry:

GUARDRAIL "[Misc] GCP Allowed Regions"
  WHEN google_*
    REQUIRE region IN ["US-CENTRAL1"]
  OVERRIDE WITH APPROVAL @default

Note the wildcard in the WHEN clause: this means that the location restrictions will be enforced for all Google resources.

This Guardrail will manifest itself in two ways:

Exposed as part of the Blueprint form. If a developer wants to deviate, they need to "unlock" the Guardrail.

The region value will be checked during your CI, and PRs that don't meet the requirements will be blocked and require review (i.e., if they were unlocked in the form and changed)

Cloud Run Function Runtime

Guardrails are incredibly flexible, which is their beauty. We can also create a Guardrail that restricts the runtime used for Cloud Run Functions specifically.

GUARDRAIL "Require Python 3.9 for Cloud Run Functions"
  WHEN google_cloudfunctions2_function
    REQUIRE build_config.runtime = "python39"

This Guardrail restricts the user from selecting anything but Python 3.9:

Conclusion

Engineers of all types are looking for tools to move faster. Cloud infrastructure is a complex, nuanced topic that require expertise and guidance: usually in the form of platform teams.

With Resourcely, we were able to turn a potentially confusing Terraform example into a guided experience that can turn the data pipeline deployment process from a headache into a breeze.