Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gke hyperdisk support #3210

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
218 changes: 218 additions & 0 deletions examples/gke-storage-hyperdisk.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
blueprint_name: gke-storage-hyperdisk
vars:
project_id: ## Set GCP Project ID Here ##
deployment_name: gke-storage-hyperdisk
region: us-central1
zone: us-central1-c

# Cidr block containing the IP of the machine calling terraform.
# The following line must be updated for this example to work.
authorized_cidr: <your-ip-address>/32

deployment_groups:
- group: primary
modules:
- id: network
source: modules/network/vpc
settings:
subnetwork_name: gke-subnet-hyperdisk
secondary_ranges:
gke-subnet-hyperdisk:
- range_name: pods
ip_cidr_range: 10.4.0.0/14
- range_name: services
ip_cidr_range: 10.0.32.0/20

- id: gke_cluster
source: modules/scheduler/gke-cluster
use: [network]
settings:
enable_persistent_disk_csi: true # enable Hyperdisk for the cluster
configure_workload_identity_sa: true
enable_private_endpoint: false # Allows for access from authorized public IPs
master_authorized_networks:
- display_name: deployment-machine
cidr_block: $(vars.authorized_cidr)
outputs: [instructions]

### Set up storage class and persistent volume claim for Hyperdisk ###
- id: hyperdisk-balanced-setup
source: modules/file-system/gke-storage
use: [gke_cluster]
settings:
storage_type: Hyperdisk-balanced
access_mode: ReadWriteOnce
sc_volume_binding_mode: Immediate
sc_reclaim_policy: Delete
sc_topology_zones: [$(vars.zone)]
pvc_count: 1
capacity_gb: 100

- id: hyperdisk-throughput-setup
source: modules/file-system/gke-storage
use: [gke_cluster]
settings:
storage_type: Hyperdisk-throughput
access_mode: ReadWriteOnce
sc_volume_binding_mode: Immediate
sc_reclaim_policy: Delete
sc_topology_zones: [$(vars.zone)]
pvc_count: 1
capacity_gb: 5000

- id: hyperdisk-extreme-setup
source: modules/file-system/gke-storage
use: [gke_cluster]
settings:
storage_type: Hyperdisk-extreme
access_mode: ReadWriteOnce
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this be default mode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If yes perhaps we can remove this line

sc_volume_binding_mode: Immediate
sc_reclaim_policy: Delete
sc_topology_zones: [$(vars.zone)]
pvc_count: 1
capacity_gb: 100

- id: sample-pool
source: modules/compute/gke-node-pool
use: [gke_cluster]
settings:
name: sample-pool
zones: [$(vars.zone)]
machine_type: c3-standard-88 # Hyperdisk-extreme required C3 machine with 88 or more vCPUs

# Train a TensorFlow model with Keras and Hyperdisk Balanced on GKE
# Tutorial: https://cloud.google.com/parallelstore/docs/tensorflow-sample
- id: hyperdisk-balanced-job
source: modules/compute/gke-job-template
use:
- gke_cluster
- hyperdisk-balanced-setup
settings:
name: tensorflow
image: jupyter/tensorflow-notebook@sha256:173f124f638efe870bb2b535e01a76a80a95217e66ed00751058c51c09d6d85d
security_context: # to make sure the job have enough access to execute the jobs and r/w from hyperdisk
- key: runAsUser
value: 1000
- key: runAsGroup
value: 100
- key: fsGroup
value: 100
command:
- bash
- -c
- |
pip install transformers datasets
python - <<EOF
from datasets import load_dataset
dataset = load_dataset("glue", "cola", cache_dir='/data/hyperdisk-balanced-pvc-0')
dataset = dataset["train"]
from transformers import AutoTokenizer
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenized_data = tokenizer(dataset["sentence"], return_tensors="np", padding=True)
tokenized_data = dict(tokenized_data)
labels = np.array(dataset["label"])
from transformers import TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers import Adam
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")
model.compile(optimizer=Adam(3e-5))
model.fit(tokenized_data, labels)
EOF
node_count: 1
outputs: [instructions]

# Train a TensorFlow model with Keras and Hyperdisk Extreme on GKE
# Tutorial: https://cloud.google.com/parallelstore/docs/tensorflow-sample
- id: hyperdisk-extreme-job
source: modules/compute/gke-job-template
use:
- gke_cluster
- hyperdisk-extreme-setup
settings:
name: tensorflow
image: jupyter/tensorflow-notebook@sha256:173f124f638efe870bb2b535e01a76a80a95217e66ed00751058c51c09d6d85d
security_context: # to make sure the job have enough access to execute the jobs and r/w from hyperdisk
- key: runAsUser
value: 1000
- key: runAsGroup
value: 100
- key: fsGroup
value: 100
command:
- bash
- -c
- |
pip install transformers datasets
python - <<EOF
from datasets import load_dataset
dataset = load_dataset("glue", "cola", cache_dir='/data/hyperdisk-extreme-pvc-0')
dataset = dataset["train"]
from transformers import AutoTokenizer
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenized_data = tokenizer(dataset["sentence"], return_tensors="np", padding=True)
tokenized_data = dict(tokenized_data)
labels = np.array(dataset["label"])
from transformers import TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers import Adam
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")
model.compile(optimizer=Adam(3e-5))
model.fit(tokenized_data, labels)
EOF
node_count: 1
outputs: [instructions]

# Train a TensorFlow model with Keras and Hyperdisk Throughput on GKE
# Tutorial: https://cloud.google.com/parallelstore/docs/tensorflow-sample
- id: hyperdisk-throughput-job
source: modules/compute/gke-job-template
use:
- gke_cluster
- hyperdisk-throughput-setup
settings:
name: tensorflow
image: jupyter/tensorflow-notebook@sha256:173f124f638efe870bb2b535e01a76a80a95217e66ed00751058c51c09d6d85d
security_context: # to make sure the job have enough access to execute the jobs and r/w from hyperdisk
- key: runAsUser
value: 1000
- key: runAsGroup
value: 100
- key: fsGroup
value: 100
command:
- bash
- -c
- |
pip install transformers datasets
python - <<EOF
from datasets import load_dataset
dataset = load_dataset("glue", "cola", cache_dir='/data/hyperdisk-throughput-pvc-0')
dataset = dataset["train"]
from transformers import AutoTokenizer
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenized_data = tokenizer(dataset["sentence"], return_tensors="np", padding=True)
tokenized_data = dict(tokenized_data)
labels = np.array(dataset["label"])
from transformers import TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers import Adam
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")
model.compile(optimizer=Adam(3e-5))
model.fit(tokenized_data, labels)
EOF
node_count: 1
outputs: [instructions]
2 changes: 1 addition & 1 deletion modules/file-system/gke-storage/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ No resources.
| <a name="input_sc_reclaim_policy"></a> [sc\_reclaim\_policy](#input\_sc\_reclaim\_policy) | Indicate whether to keep the dynamically provisioned PersistentVolumes of this storage class after the bound PersistentVolumeClaim is deleted.<br/>[More details about reclaiming](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming)<br/>Supported value:<br/>- Retain<br/>- Delete | `string` | n/a | yes |
| <a name="input_sc_topology_zones"></a> [sc\_topology\_zones](#input\_sc\_topology\_zones) | Zone location that allow the volumes to be dynamically provisioned. | `list(string)` | `null` | no |
| <a name="input_sc_volume_binding_mode"></a> [sc\_volume\_binding\_mode](#input\_sc\_volume\_binding\_mode) | Indicates when volume binding and dynamic provisioning should occur and how PersistentVolumeClaims should be provisioned and bound.<br/>Supported value:<br/>- Immediate<br/>- WaitForFirstConsumer | `string` | `"WaitForFirstConsumer"` | no |
| <a name="input_storage_type"></a> [storage\_type](#input\_storage\_type) | The type of [GKE supported storage options](https://cloud.google.com/kubernetes-engine/docs/concepts/storage-overview)<br/>to used. This module currently support dynamic provisioning for the below storage options<br/>- Parallelstore | `string` | n/a | yes |
| <a name="input_storage_type"></a> [storage\_type](#input\_storage\_type) | The type of [GKE supported storage options](https://cloud.google.com/kubernetes-engine/docs/concepts/storage-overview)<br/>to used. This module currently support dynamic provisioning for the below storage options<br/>- Parallelstore<br/>- Hyperdisk-balanced<br/>- Hyperdisk-throughput<br/>- Hyperdisk-extreme | `string` | n/a | yes |

## Outputs

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ${pvc_name}
labels:
%{~ for key, val in labels ~}
${key}: ${val}
%{~ endfor ~}
spec:
accessModes:
- ${access_mode}
resources:
requests:
storage: ${capacity}
storageClassName: ${storage_class_name}
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ${pvc_name}
labels:
%{~ for key, val in labels ~}
${key}: ${val}
%{~ endfor ~}
spec:
accessModes:
- ${access_mode}
resources:
requests:
storage: ${capacity}
storageClassName: ${storage_class_name}
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ${pvc_name}
labels:
%{~ for key, val in labels ~}
${key}: ${val}
%{~ endfor ~}
spec:
accessModes:
- ${access_mode}
resources:
requests:
storage: ${capacity}
storageClassName: ${storage_class_name}
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ${name}
labels:
%{~ for key, val in labels ~}
${key}: ${val}
%{~ endfor ~}
provisioner: pd.csi.storage.gke.io
allowVolumeExpansion: true
parameters:
type: hyperdisk-balanced
provisioned-throughput-on-create: "250Mi"
provisioned-iops-on-create: "7000"
volumeBindingMode: ${volume_binding_mode}
reclaimPolicy: ${reclaim_policy}
%{~ if topology_zones != null ~}
allowedTopologies:
- matchLabelExpressions:
- key: topology.gke.io/zone
values:
%{~ for z in topology_zones ~}
- ${z}
%{~ endfor ~}
%{~ endif ~}
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ${name}
labels:
%{~ for key, val in labels ~}
${key}: ${val}
provisioner: pd.csi.storage.gke.io
allowVolumeExpansion: true
parameters:
%{~ endfor ~}
type: hyperdisk-extreme
provisioned-iops-on-create: "50000"
volumeBindingMode: ${volume_binding_mode}
reclaimPolicy: ${reclaim_policy}
%{~ if topology_zones != null ~}
allowedTopologies:
- matchLabelExpressions:
- key: topology.gke.io/zone
values:
%{~ for z in topology_zones ~}
- ${z}
%{~ endfor ~}
%{~ endif ~}
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ${name}
labels:
%{~ for key, val in labels ~}
${key}: ${val}
%{~ endfor ~}
provisioner: pd.csi.storage.gke.io
allowVolumeExpansion: true
parameters:
type: hyperdisk-throughput
provisioned-throughput-on-create: "250Mi"
volumeBindingMode: ${volume_binding_mode}
reclaimPolicy: ${reclaim_policy}
%{~ if topology_zones != null ~}
allowedTopologies:
- matchLabelExpressions:
- key: topology.gke.io/zone
values:
%{~ for z in topology_zones ~}
- ${z}
%{~ endfor ~}
%{~ endif ~}
7 changes: 5 additions & 2 deletions modules/file-system/gke-storage/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,15 @@ variable "storage_type" {
The type of [GKE supported storage options](https://cloud.google.com/kubernetes-engine/docs/concepts/storage-overview)
to used. This module currently support dynamic provisioning for the below storage options
- Parallelstore
- Hyperdisk-balanced
- Hyperdisk-throughput
- Hyperdisk-extreme
EOT
type = string
nullable = false
validation {
condition = var.storage_type == null ? false : contains(["parallelstore"], lower(var.storage_type))
error_message = "Allowed string values for var.storage_type are \"Parallelstore\"."
condition = var.storage_type == null ? false : contains(["parallelstore", "hyperdisk-balanced", "hyperdisk-throughput", "hyperdisk-extreme"], lower(var.storage_type))
error_message = "Allowed string values for var.storage_type are \"Parallelstore\", \"Hyperdisk-balanced\", \"Hyperdisk-throughput\", \"Hyperdisk-extreme\"."
}
}

Expand Down
Loading
Loading