My Google Cloud Skill Boost Public Profile
GCP uses 'Project' to organize resources. AFAIR there wasn't something equivalent in AWS. Scalabale, Available, Reliable, Secure, 4Vs Infrastructure parts: Compute, Storage, Big Data, ML Compute and Storage can scale independently based on need. They are decoupled.
Compute offerings:
- Compute: IaaS offering, Gives compute, some storage
- GKE: containerized applications
- App Enginer: fully managed PaaS
- Cloud Funcitons: FaaS
- Cloud Run: fully managed platform, auto scale (like Fargate?)
Storage and DB: Right storage depends on data and business need
- Cloud storage (Unstructured, has storage classes like S3)
- BigTable (Analytical, NoSQL, real-time high-throughput apps like Hbase?)
- Cloud SQL (Analytical + Transactional, relational, local/regional scale, Aurora?)
- Spanner (Analytical + Transactional, relational, global scale)
- Firestore (Transactional, NoSQL document oriented db, Like DynamoDB)
- BigQuery (Analytical, SQL, Datawarehouse soln. like Hive/Athena?)
Big Data and ML product Categories:
- Ingestion and Process (stream and batch): Pub/Sub, Dataflow (streaming), Dataproc, Cloud DataFusion
- Storage: Cloud Storage, BigTable, BigQuery, FireStore, Spanner
- Analytics: BigQuery, Data Studio, Looker
- ML: Vertex AI, Auto ML
Data lake storage: Storage layer for the data lake needs to consider the nature of the data being ingested and the purpose it will be used for. The image below provides a decision tree for storage service selection based on these considerations.
Also relevant are the following 2 blog posts for a comparison of different storage offerings:
Design an optimal storage strategy for your cloud workload
A map of storage options in Google Cloud
Choosing the right compute option in GCP: a decision tree
Pub/Sub:
- Distributed message oriented architecture
- Ensures at least once delivery.
- Supports many different inputs and outputs.
- Can deliver/broadcase msg to subscribers e.g Dataflow.
Dataflow:
- Dataflow can handle both stream and batch data. It is mainly a service for ETL. (like Glue Jobs?)
- Apache Beam can be used for pipeline design, provides pipeline templates.
- Choosing execution engine to run the pipeline. Dataflow is a fully managed service to run Beam pipelines. It is serverless and NoOps (has auto-scalling , graph optimization etc.)
- Dataflow has templates for Streaming, Batch, Utility which gives a stattng point for common use cases. For ex, we can ingest data from pub sub into dataflow then load it in bigquery.
Visualization:
- Once data is in a database like bigquery, we can use Looker to visualize and create dashbroads.
- Another tool is Data Studio. it is more easy to use for non-experts. It doesn't require creating connectors that Looker does. Steps: choose template, link dashboard to datasource (e.g. BQ), explore dashboard
Big Query is like Amazon Athena
Big Query:
- Fully managed serverles soln. provides Storage (dataset/tables) and Analytics (via SQL), and has built-in ML
- BQ connects with AutoML and Vertex AI Workbench i.e. we can load / store datasets in BQ
- BQ can also query external data sources e.g. Cloud Storage, Cloud Spanner, Could SQL. Also from AWS, Azure.
3 patterns to load: batch load, streaming, generated results from queries
Big Query ML: can train models via queries, can do transforms like 1-hot enc, can predict with queries, hyp param tuning etc. also supports loading tf modesl, and exporting BQ models. kinda like AWS Athena.
BQ ML project phases:
- ETL into BigQuery (has connectors to other google services)
- pre-process features (using sql)
- create model inside bigquery (CREATE MODEL command)
- eval performance of model (ML.EVALUATE command)
- use model to predict new data (ML.PREDICT(model, data) command)
BQ ML Key Commands:
CREATE OR REPLACE MODEL `model_name`
WITH OPTIONS(model_type, input_label_cols) AS
<training dataset>
Comparison of type of Data Engg workload and GCP Solutions:
Source: Architect your data lake on Google Cloud with Data Fusion and Composer
GCP offers 4 options for building ML models
- BQ ML: as we saw above
- Pre-Built APIs: for models built and trained by Google
- AutoML: point and click NoCode interface
- Custom Training: full flexibility and control over ML pipeline Comparison - in depth
Pre-Built APIs: offered as a service, for Speech-to-Text, NLP, translation, text-to-speech, vision api, video intelligence.
AutoML: Goal was to automate ML pipelines. It has 2 vital parts: transfer learning and neural architecture search. We can upload data into AutoML (from bigquery, cloud storage etc.)
Custom model: Vertex AI Workbench can work as an AI development env. Can use pre-built vs customer 'containers'.
Vertex AI: unified platform that brings all components of ML workflow together. Provides Feature Store, Hyperparam tuning (Vizier), XAI, pipelines
AutoML based example Data Preparation: Upload data, provide a name, select data type and objective Feature Engg: using Feature Store, it makes features shareable, reusabale, scalable Model training and Model eval: VAI provides extensive set of metrics after training via AutoML Model serving: - Serving = deployment + monitoring - Deploy options: Endpoint for RT preds, Batch Pred, Offline / Edge pred - Monitoring: VAI Pipelines,