Skip to content

Commit

Permalink
Merge pull request aws-samples#639 from bkgardiner/main
Browse files Browse the repository at this point in the history
Adding Inferentia AI/ML Module to Workshop
  • Loading branch information
niallthomson authored Aug 23, 2023
2 parents 5393786 + 2e97b9b commit a8ccb03
Show file tree
Hide file tree
Showing 29 changed files with 751 additions and 31 deletions.
60 changes: 35 additions & 25 deletions governance/steering.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,59 @@
# Steering Committee and Module Leads

## Steering Commitee Members

The Steering Committee is a 6 member body, overseeing the governance of the EKS Workshop.

### Terms end in February 2024
|Name|Profile|Role|
|:----|:-------|:----|
|Sai Vennam|[@svennam92](https://github.com/svennam92)|Principal EKS DA
|Niall Thomson|[@niallthomson](https://github.com/niallthomson)|Specialist Solution Architect, Containers|
|Ray Krueger|[@raykrueger](https://github.com/raykrueger)|Principal Container Specialist|
|Ameet Naik|[@ameetnaik](https://github.com/ameetnaik)|Technical Account Manager|
|Kamran Habib|[@kmhabib](https://github.com/kmhabib)|Solution Architect (TFC at large)|
|Theo Salvo|[@buzzsurfr](https://github.com/buzzsurfr)|Container Specialist (TFC core team member)|

| Name | Profile | Role |
| :------------ | :----------------------------------------------- | :------------------------------------------ |
| Sai Vennam | [@svennam92](https://github.com/svennam92) | Principal EKS DA |
| Niall Thomson | [@niallthomson](https://github.com/niallthomson) | Specialist Solution Architect, Containers |
| Ray Krueger | [@raykrueger](https://github.com/raykrueger) | Principal Container Specialist |
| Ameet Naik | [@ameetnaik](https://github.com/ameetnaik) | Technical Account Manager |
| Kamran Habib | [@kmhabib](https://github.com/kmhabib) | Solution Architect (TFC at large) |
| Theo Salvo | [@buzzsurfr](https://github.com/buzzsurfr) | Container Specialist (TFC core team member) |

## Working Groups

The working groups are led by chairs (6 month terms) and maintainers (6 month terms).

|Working Group|Chair|Maintainers|
|:----|:-------|:----|
|Infrastructure|[Niall Thomson](https://github.com/niallthomson)||
|Fundamentals|[Sai Vennam](https://github.com/svennam92)|[Bijith Nair](https://github.com/bijithnair), [Tolu Okuboyejo](https://github.com/oktab1), [Hemanth AVS](https://github.com/hemanth-avs)|
|Autoscaling|[Sanjeev Ganjihal](https://github.com/sanjeevrg89)||
|Automation|[Carlos Santana](https://github.com/csantanapr)|[Tsahi Duek](https://github.com/tsahiduek), [Christina Andonov](https://github.com/candonov), [Sébastien Allamand](https://github.com/allamand)|
|Machine Learning|[Masatoshi Hayashi](https://github.com/literalice)||
|Networking|[Sheetal Joshi](https://github.com/sheetaljoshi)|[Umair Ishaq](https://github.com/umairishaq)|
|Observability|[Nirmal Mehta](https://github.com/normalfaults)|[Steven David](https://github.com/StevenDavid)|
|Security|[Rodrigo Bersa](https://github.com/rodrigobersa)| |
|Storage|[Eric Heinrichs](https://github.com/heinrichse)|[Andrew Peng](https://github.com/pengc99)|
| Working Group | Chair | Maintainers |
| :--------------- | :------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------- |
| Infrastructure | [Niall Thomson](https://github.com/niallthomson) | |
| Fundamentals | [Sai Vennam](https://github.com/svennam92) | [Bijith Nair](https://github.com/bijithnair), [Tolu Okuboyejo](https://github.com/oktab1), [Hemanth AVS](https://github.com/hemanth-avs) |
| Autoscaling | [Sanjeev Ganjihal](https://github.com/sanjeevrg89) | |
| Automation | [Carlos Santana](https://github.com/csantanapr) | [Tsahi Duek](https://github.com/tsahiduek), [Christina Andonov](https://github.com/candonov), [Sébastien Allamand](https://github.com/allamand) |
| Machine Learning | [Masatoshi Hayashi](https://github.com/literalice) | [Benjamin Gardiner](https://github.com/bkgardiner) |
| Networking | [Sheetal Joshi](https://github.com/sheetaljoshi) | [Umair Ishaq](https://github.com/umairishaq) |
| Observability | [Nirmal Mehta](https://github.com/normalfaults) | [Steven David](https://github.com/StevenDavid) |
| Security | [Rodrigo Bersa](https://github.com/rodrigobersa) | |
| Storage | [Eric Heinrichs](https://github.com/heinrichse) | [Andrew Peng](https://github.com/pengc99) |

## Wranglers

Wranglers will work across all topic areas and serve for at least 6 months.
|Name|Profile|Role|
|:----|:-------|:----|
|Math Bruneau|[@ROunofF](https://github.com/ROunofF)|Specialist Solution Architect, Containers|


## Emeritus
|Name|Profile|Role|
|:----|:-------|:----|
|Jeremy Cowan|[@jicowan](https://github.com/jicowan)|EKS DA manager|

| Name | Profile | Role |
| :----------- | :------------------------------------- | :------------- |
| Jeremy Cowan | [@jicowan](https://github.com/jicowan) | EKS DA manager |

## Meetings

### Schedule and Cadence

The steering committee will host a public meeting every third Thursday of the month at 9AM CT. <!--update with Chime link-->

### Resources
* <!--add links to meeting notes and recordings-->

- <!--add links to meeting notes and recordings-->

## Contact
* Mailing List: <[email protected]>

- Mailing List: <[email protected]>
25 changes: 25 additions & 0 deletions manifests/modules/aiml/inferentia/.workshop/cleanup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/bin/bash

set -e

echo "Deleting AIML resources..."

kubectl delete namespace aiml > /dev/null

echo "Deleting Karpenter provisioners..."

kubectl delete provisioner --all > /dev/null
kubectl delete awsnodetemplate --all > /dev/null

echo "Waiting for Karpenter nodes to be removed..."

EXIT_CODE=0

timeout --foreground -s TERM 30 bash -c \
'while [[ $(kubectl get nodes --selector=type=karpenter -o json | jq -r ".items | length") -gt 0 ]];\
do sleep 5;\
done' || EXIT_CODE=$?

if [ $EXIT_CODE -ne 0 ]; then
echo "Warning: Karpenter nodes did not clean up"
fi
128 changes: 128 additions & 0 deletions manifests/modules/aiml/inferentia/.workshop/terraform/addon.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
data "aws_subnets" "private" {
tags = {
created-by = "eks-workshop-v2"
env = local.addon_context.eks_cluster_id
}

filter {
name = "tag:Name"
values = ["*Private*"]
}
}

module "iam_assumable_role_inference" {
source = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"
version = "~> v5.5.0"
create_role = true
role_name = "${local.addon_context.eks_cluster_id}-inference"
provider_url = local.addon_context.eks_oidc_issuer_url
role_policy_arns = [aws_iam_policy.inference.arn]
oidc_fully_qualified_subjects = ["system:serviceaccount:aiml:inference"]

tags = local.tags
}


resource "aws_iam_policy" "inference" {
name = "${local.addon_context.eks_cluster_id}-inference"
path = "/"
description = "IAM policy for the inferenct workload"

policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::${aws_s3_bucket.inference.id}",
"arn:aws:s3:::${aws_s3_bucket.inference.id}/*"
]
}
]
}
EOF
}

module "karpenter" {
source = "github.com/aws-ia/terraform-aws-eks-blueprints?ref=v4.25.0//modules/kubernetes-addons/karpenter"
addon_context = merge(local.addon_context, { default_repository = local.amazon_container_image_registry_uris[data.aws_region.current.name] })

node_iam_instance_profile = aws_iam_instance_profile.karpenter_node.name

helm_config = {
set = [{
name = "replicas"
value = "1"
}]
}
}

resource "aws_iam_instance_profile" "karpenter_node" {
name = "${local.addon_context.eks_cluster_id}-karpenter-node"
role = aws_iam_role.karpenter_node.name
}

resource "aws_iam_role" "karpenter_node" {
name = "${local.addon_context.eks_cluster_id}-karpenter-node"

assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Sid = ""
Principal = {
Service = "ec2.amazonaws.com"
}
},
]
})

managed_policy_arns = [
"arn:${local.addon_context.aws_partition_id}:iam::aws:policy/AmazonEKS_CNI_Policy",
"arn:${local.addon_context.aws_partition_id}:iam::aws:policy/AmazonEKSWorkerNodePolicy",
"arn:${local.addon_context.aws_partition_id}:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly",
"arn:${local.addon_context.aws_partition_id}:iam::aws:policy/AmazonSSMManagedInstanceCore"
]

tags = local.tags
}

data "http" "neuron_device_plugin_rbac_manifest" {
url = "https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/v2.6.0/src/k8/k8s-neuron-device-plugin-rbac.yml"
}

data "http" "neuron_device_plugin_manifest" {
url = "https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/v2.6.0/src/k8/k8s-neuron-device-plugin.yml"
}

data "kubectl_file_documents" "neuron_device_plugin_rbac_doc" {
content = data.http.neuron_device_plugin_rbac_manifest.response_body
}

data "kubectl_file_documents" "neuron_device_plugin_doc" {
content = data.http.neuron_device_plugin_manifest.response_body
}

resource "kubectl_manifest" "neuron_device_plugin_rbac" {
for_each = data.kubectl_file_documents.neuron_device_plugin_rbac_doc.manifests
yaml_body = each.value
}

resource "kubectl_manifest" "neuron_device_plugin" {
for_each = data.kubectl_file_documents.neuron_device_plugin_doc.manifests
yaml_body = each.value
}

output "environment" {
value = <<EOF
export AIML_NEURON_ROLE_ARN=${module.iam_assumable_role_inference.iam_role_arn}
export AIML_NEURON_BUCKET_NAME=${resource.aws_s3_bucket.inference.id}
export AIML_DL_IMAGE=763104351884.dkr.ecr.${data.aws_region.current.name}.amazonaws.com/pytorch-inference-neuron:1.13.1-neuron-py310-sdk2.12.0-ubuntu20.04
export AIML_SUBNETS=${data.aws_subnets.private.ids[0]},${data.aws_subnets.private.ids[1]},${data.aws_subnets.private.ids[2]}
export KARPENTER_NODE_ROLE="${aws_iam_role.karpenter_node.arn}"
EOF
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
resource "aws_s3_bucket" "inference" {
bucket_prefix = "eksworkshop-inference"
force_destroy = true

tags = local.tags
}
1 change: 1 addition & 0 deletions manifests/modules/aiml/inferentia/base/config.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
AIML_NEURON_ROLE_ARN
25 changes: 25 additions & 0 deletions manifests/modules/aiml/inferentia/base/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
configMapGenerator:
- name: base-vars
namespace: aiml
env: config.properties
options:
disableNameSuffixHash: true
replacements:
- source:
kind: ConfigMap
name: base-vars
version: v1
namespace: aiml
fieldPath: data.AIML_NEURON_ROLE_ARN
targets:
- select:
kind: ServiceAccount
name: inference
namespace: aiml
fieldPaths:
- metadata.annotations.[eks.amazonaws.com/role-arn]
resources:
- serviceaccount.yaml
- namespace.yaml
4 changes: 4 additions & 0 deletions manifests/modules/aiml/inferentia/base/namespace.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: aiml
7 changes: 7 additions & 0 deletions manifests/modules/aiml/inferentia/base/serviceaccount.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: inference
namespace: aiml
annotations:
eks.amazonaws.com/role-arn: ${AIML_NEURON_ROLE_ARN}
16 changes: 16 additions & 0 deletions manifests/modules/aiml/inferentia/compiler/compiler.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
apiVersion: v1
kind: Pod
metadata:
labels:
role: compiler
name: compiler
namespace: aiml
spec:
containers:
- command:
- sh
- -c
- sleep infinity
image: ${AIML_DL_IMAGE}
name: compiler
serviceAccountName: inference
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
AIML_DL_IMAGE
26 changes: 26 additions & 0 deletions manifests/modules/aiml/inferentia/compiler/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ../base
configMapGenerator:
- name: compiler-vars
namespace: aiml
env: config.properties
options:
disableNameSuffixHash: true
replacements:
- source:
kind: ConfigMap
name: compiler-vars
version: v1
namespace: aiml
fieldPath: data.AIML_DL_IMAGE
targets:
- select:
kind: Pod
name: compiler
namespace: aiml
fieldPaths:
- spec.containers.0.image
resources:
- compiler.yaml
17 changes: 17 additions & 0 deletions manifests/modules/aiml/inferentia/compiler/trace.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import torch
import numpy as np
import os
import torch_neuron
from torchvision import models

image = torch.zeros([1, 3, 224, 224], dtype=torch.float32)

## Load a pretrained ResNet50 model
model = models.resnet50(pretrained=True)

## Tell the model we are using it for evaluation (not training)
model.eval()
model_neuron = torch.neuron.trace(model, example_inputs=[image])

## Export to saved model
model_neuron.save("resnet50_neuron.pt")
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
AIML_DL_IMAGE
Loading

0 comments on commit a8ccb03

Please sign in to comment.