Skip to content

Latest commit

 

History

History
92 lines (75 loc) · 3.79 KB

06 Data Processing using Google Cloud Functions.md

File metadata and controls

92 lines (75 loc) · 3.79 KB

Data Processing using Google Cloud Functions

Overview of Google Cloud Functions

Google Cloud Functions are nothing but serverless functions provided as part of Google Cloud Platform.

  • Simplified developer experience and increased developer velocity
  • Pay only for what you use
  • Avoid lock-in with open technology You can follow this page to get more details about Google Cloud Functions.

Create First Google Cloud Function using Python

Here are the instructions to create first google cloud function using Python.

  • Search for Cloud Function
  • Follow the steps as demonstrated to create first Google Cloud Function
  • Name: file_format_converter
  • Language: python3.9
  • Type: Cloud Storage (Default is HTTP)
  • Review Default Memory and Timeout Settings

Run and Validate Google Cloud Function

Here are the instructions to run and validate Google Cloud Function.

  • Go to Testing and add this JSON.
{
    "name": "testing.csv"
}

Review File Format Converter Logic using Pandas

TBD: Need to make the logic dynamic.

  • Source: landing/retail_db/orders
  • Source File Format: CSV
  • Schema: landing/retail_db/schemas.json
  • Target: bronze/retail_db/orders
  • Target File Format: parquet

Here is the design for the file format conversion.

  • The application should take table name as argument.
  • It has to read the schema from schemas.json and need to be applied on CSV data while creating Pandas Data Frame.
  • The Data Frame should be written to target using target file format.
  • Source Bucket, Target Bucket as well as base folders should be passed as environment variables.
import json
import os
import pandas as pd

def get_columns(input_base_dir, ds_name):
    schemas = json.load(open(f'{input_base_dir}/schemas.json'))
    columns = list(map(lambda td: td['column_name'], schemas[ds_name]))
    return columns


input_base_dir = os.environ.get('INPUT_BASE_DIR')
output_base_dir = os.environ.get('OUTPUT_BASE_DIR')
ds_name = 'orders'
columns = get_columns(input_base_dir, ds_name)
print(columns)
for file in os.listdir(f'{input_base_dir}/{ds_name}'):
    print(file)
    df = pd.read_csv(f'{input_base_dir}/{ds_name}/{file}', names=columns)
    os.makedirs(f'{output_base_dir}/{ds_name}', exist_ok=True)
    df.to_parquet(f'{output_base_dir}/{ds_name}/{file}.snappy.parquet')

Deploy Inline Application as Google Cloud Function

As we have reviewed the core logic, let us make sure to deploy file format converted as Google Cloud Function.

  • Create Function with relevant run time (Python 3.9).
  • Update requirements.txt with all the required dependencies.
  • Update the program file with the logic to convert the file format.
  • Review the configuration and make sure memory is upgraded to 1 GB (from 256 MB)
  • Update Environment Variables for bucket names as well as base folder names.

Run Inline Application as Google Cloud Function

As File Format Converter as deployed as Cloud Function, let us go through the details of running it. We will also validate to confirm if the Cloud Function is working as expected or not.

  • Run the Cloud Function by passing Table Name as run time argument.
  • Review the logs to confirm, the Cloud Function is executed with out any errors.
  • Review the files in GCS in the target location.
  • Use Pandas read_parquet to see if the data in the converted files can be read into Pandas Data Frame.

Setup Project for Google Cloud Function

Let us go ahead and setup project for Google Cloud Function using VS Code.

  • Create new project.
  • Create Python Virtual Environment using Python 3.9
  • Add dependencies for local develement to requirements_dev.txt
  • Add Driver Program for Google Cloud Function.

Build and Deploy Application in GCS as Google Cloud Function

Run Deployed Application in GCS as Google Cloud Function