Synopsis:
- Use BigQuery for data processing and EDA
- Use VertexAI to train and deploy custom TensorFlow Regression model to predict customer lifetime value (CLV)
- We will start with a local BQ and TF workflow, and progress toward training and deploying the model in the cloud with Vertex AI
First, we Activate Cloud Shell so that we can send commands to Google Cloud. Then, using gcloud
, we enable the required services / APIS e.g. IAM, monitoring, AI Platform / Vertex AI, Cloud Build etc.
We need to create Service Accounts so that the resources we need can authenticate and communicate with each other and have the right permissions. We need to give this SA access to:
- Cloud Storage for writing and retrieving Tensorboard logs
- BigQuery data source to read data into TensorFlow model
- Vertex AI for running model training, deployment, and explanation jobs
This can be accomplished via Cloud Shell i.e. gcloud
or UI, We will use gcloud
cmd line.
We create a Vertex AI Workbench notebook instance with TensorFlow Enterprise environment. It can be understood as a VM with TF and Jupyter Notebook support.
We open our Notebook instance and then we clone a repo which has the necessary starting code and resources to complete this lab. GCP - training data analyst repo We get the following after cloning.
We install the dependencies specified in the requirements.txt
file.
Then, open the notebook and follow the instructions given in it.
Essentially, we:
- Download a
.xlsx
dataset from internet and convert it to.csv
- Creata Bucket for our BQ dataset
- Create a BQ dataset and load csv data into it
- Use BQ queries to load data into pandas dataframes
- Create a simple baseline model in BQ, using SQL stat functions (could have also used BQ ML)
- Pass those dfs to TensorFlow to create a DNN Regressor model Up to this point, we have a 'local' workflow, now we will work towards making it 'cloud native'
The details are in this notebook: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/self-paced-labs/vertex-ai/vertex-ai-qwikstart/lab_exercise_long.ipynb
Created bucket myaseen-qwik-lab2-dataprep
Accept service agreements and so on.
Tasks 3-8 we done in Dataprep UI.
We can use the service to read data and apply transformations and joins while visually inspecting the data. After we are done, we can run the job and store it back into the cloud storage. It has a DSL called Wrangler to apply transformations to columns and data.
Synopsis: In this lab, you will learn how to create a streaming pipeline using one of Google's Cloud Dataflow templates. More specifically, you will use the Cloud Pub/Sub to BigQuery template, which reads messages written in JSON from a Pub/Sub topic and pushes them to a BigQuery table.
In this lab, we have a choice of either using Console (UI) or Cloud Shell (Cmdline) to complete the tasks. I chose cmdline, bcz imo it makes things more clear.
Can use bg mk
command
We can use the Template and only need to specify PubSub topic and Output table. The template takes care of the rest.
This is what the template is doing:
Once data starts being populated in the BQ Table, we can query against it.
- Dataflow supports both stream and batch workflows, here we worked with Pub/Sub topic to BigQuery -- which is a stream dataflow.
- Dataflow is a distribution of Apache Beam
The Python version of this lab allows us to submit jobs from the local machine e.g. a container image. But that didn't work because their authentication mode was broken.
Synopsis: In this lab you will set up your Python development environment, get the Cloud Dataflow SDK for Python, and run an example pipeline using the Cloud Console
Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters.
We need Dataproc API enabled for this.
Synopsis: This lab shows you how to use the Google Cloud Console to create a Google Cloud Dataproc cluster, run a simple Apache Spark job in the cluster, then modify the number of workers in the cluster.
We can easily create a cluster over either Compute Enginer or GKE. Here we're using GCE. We can specify machine types for Master and Worker nodes. If we want we can later also scale the server up and down e.g. change the number of worker nodes.
We can submit a job against the spawned cluster. Here we are running Spark example to compute Pi.
Once the job has been submitted we can see output and details in the Job Details tab.
Cloud Natural Language API lets you extract information about people, places, events, (and more) mentioned in text documents, news articles, or blog posts. Can be used for sentiment and intent analysis.
Cloud Natural Language API features: syntax analysis, Entity Recognition, Sentiment Analysis, Content Classification (pre-define categories), Multi-Language, Integrated REST API.
Synopsis: In this lab you'll use the analyze-entities
method to ask the Cloud Natural Language API to extract "entities" (e.g. people, places, and events) from a snippet of text.
Console: API & Services > Create an API key. This is needed so that we can access resources within our project.
Cmdline:
- Get the project ID:
export GOOGLE_CLOUD_PROJECT=$(gcloud config get-value core/project)
- Create Service Account to access NL API:
gcloud iam service-accounts create my-natlang-sa --display-name "my natural language service account"
- create credentials to log in as your new service account and save them in json file.
gcloud iam service-accounts keys create ~/key.json \
--iam-account my-natlang-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com
- Set env var that points to the key.
export GOOGLE_APPLICATION_CREDENTIALS="/home/${USER}/key.json"
To make a request, a Compute Engine instance has been provision already. (why can't we make request directly from Cloud Shell? - we can). We SSH into the instance and run:
gcloud ml language analyze-entities --content="Michelangelo Caravaggio, Italian painter, is known for 'The Calling of Saint Matthew'." > result.json
Where exactly was the SA that we created was used?
The Google Cloud Speech API enables easy integration of Google speech recognition technologies into developer applications. The Speech API allows you to send audio and receive a text transcription from the service.
Synopsis: Learn to create an API Key, Create and Call Speech API request
Go to API & Services and Create and API key. This need so that we can access resources within our project.
- To create an API key, click Navigation menu > APIs & services > Credentials.
- Then click Create credentials.
- In the drop down menu, select API key.
Create a .json
file with required format.
{
"config": {
"encoding":"FLAC",
"languageCode": "en-US"
},
"audio": {
"uri":"gs://cloud-samples-tests/speech/brooklyn.flac"
}
}
In config
, you tell the Speech API how to process the request. The encoding
parameter tells the API which type of audio encoding you're using while the file is being sent to the API. FLAC
is the encoding type for .raw files.
In the audio
object, you pass the API the uri of the audio file in Cloud Storage.
Query the Speech API endpoint and provide the json file as data. It returns a json response
curl -s -X POST -H "Content-Type: application/json" --data-binary @request.json \
"https://speech.googleapis.com/v1/speech:recognize?key=${API_KEY}"
Google Cloud Video Intelligence makes videos searchable and discoverable by extracting metadata with an easy to use REST API. You can now search every moment of every video file in your catalog. It helps you identify key entities (nouns) within your video; and when they occur within the video. Separate signal from noise by retrieving relevant information within the entire video, shot-by-shot, -or per frame.
In cloud shell
- create SA:
gcloud iam service-accounts create <sa-name>
- Create an SA key file:
gcloud iam service-accounts keys create key.json --iam-account <sa-name>@<your-project-123>.iam.gserviceaccount.com
- Authenticate your SA:
gcloud auth activate-service-account --key-file key.json
- Obtain auth token:
gcloud auth print-access-token
- create a JSON request file:
cat > request.json <<EOF
{
"inputUri":"gs://spls/gsp154/video/train.mp4",
"features": [
"LABEL_DETECTION"
]
}
EOF
- Use
curl
to make avideos:annotate
request passing the filename of the entity request
curl -s -H 'Content-Type: application/json' \
-H 'Authorization: Bearer '$(gcloud auth print-access-token)'' \
'https://videointelligence.googleapis.com/v1/videos:annotate' \
-d @request.json
- To check progress and response:
curl -s -H 'Content-Type: application/json' \
-H 'Authorization: Bearer '$(gcloud auth print-access-token)'' \
'https://videointelligence.googleapis.com/v1/projects/PROJECTS/locations/LOCATIONS/operations/OPERATION_NAME'
Use the Dataflow batch template Text Files on Cloud Storage to BigQuery under "Process Data in Bulk (batch)" to transfer data from a Cloud Storage bucket to a BQ Table.
- create bucket
gsutil mb gs://<name>
- create table:
bq mk
OR from UI - create dataflow job and run
Dataproc is essentially managed Hadoop and Spark Cluster. We create a cluster: Dataproc > Create Cluster > On Compute Engine Once created, we login to the worked node and put necessary data files there. Go to the create cluster page > Submit Job > Job Type = Spark . Then provide path to JAR files containing the code to run and any arguments etc.
{
"config": {
"encoding": "FLAC",
"languageCode": "en-US"
},
"audio": {
"uri": "gs://cloud-training/gsp323/task3.flac"
}
}
export API_KEY=(created for UI)
curl -s -X POST -H "Content-Type: application/json" --data-binary @request.json \
"https://speech.googleapis.com/v1/speech:recognize?key=${API_KEY}" > task3-gcs-554.result
gsutil cp task3-gcp-number.result $LAB_BUCKET/
export LAB_BUCKET=gs://name
export TEXT="Old Norse texts portray Odin as one-eyed and long-bearded, frequently wielding a spear named Gungnir and wearing a cloak and a broad hat."
gcloud ml language analyze-entities --content="Old Norse texts portray Odin as one-eyed and long-bearded, frequently wielding a spear named Gungnir and wearing a cloak and a broad hat." > task4-cnl-number.result
gsutil cp task4-cnl-number.result $LAB_BUCKET/