This document is provided for informational purposes only. It represents the current product offerings and practices from Amazon Web Services (AWS) as of the date of issue of this document, which are subject to change without notice. Customers are responsible for making their own independent assessment of the information in this document and any use of AWS products or services, each of which is provided “as is” without warranty of any kind, whether express or implied. This document does not create any warranties, representations, contractual commitments, conditions, or assurances from AWS, its affiliates, suppliers, or licensors. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers.
© 2024 Amazon Web Services, Inc. or its affiliates. All Rights Reserved. This work is licensed under a Creative Commons Attribution 4.0 International License.
This AWS Content is provided subject to the terms of the AWS Customer Agreement available at http://aws.amazon.com/agreement or other written agreement between the Customer and either Amazon Web Services, Inc. or Amazon Web Services EMEA SARL or both.
Author: Author Name
Approver: Approver Name
Last Date Approved:
As part of our ongoing commitment to customers, AWS is providing this security incident response playbook that describes the steps needed to investigate security events where Amazon SageMaker is either the source or a target of unauthorized use within your AWS account(s). The purpose of this document is to provide prescriptive guidance on the actions to take once you suspect a security event has taken place.
Aspects of AWS incident response
Amazon SageMaker is a fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and confidently build, train, and deploy ML models into a production-ready hosted environment. It provides a UI experience for running ML workflows that makes SageMaker ML tools available across multiple integrated development environments (IDEs).
With SageMaker, you can store and share your data without having to build and manage your own servers. With built-in support for bring-your-own-algorithms and frameworks, SageMaker offers flexible distributed training options that adjust to specific workflows. For additional information, review the developer guide for Amazon SageMaker here: https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html
Proactively prepare your environment by implementing preventative (Service Control Policies) and detective controls (See the Detection section for Config Rules).
Preventative (SCP)
- VPC Deployment
- For example, the following SCP will prevent users from launching any notebooks, training, or processing jobs unless a VPC is specified.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VPCDeployment",
"Effect": "Deny",
"Action": [
"sagemaker:CreateHyperParameterTuningJob",
"sagemaker:CreateModel",
"sagemaker:CreateNotebookInstance",
"sagemaker:CreateProcessingJob",
"sagemaker:CreateTrainingJob"
],
"Resource": [
"*"
],
"Condition": {
"Null": {
"sagemaker:VpcSecurityGroupIds": "true",
"sagemaker:VpcSubnets": "true"
}
}
}
]
}
- Enforce job encryption
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyUnencryptedVolumes",
"Effect": "Deny",
"Action": [
"sagemaker:CreateHyperParameterTuningJob",
"sagemaker:CreateTrainingJob",
"sagemaker:CreateEndpointConfig",
"sagemaker:CreateTransformJob"
],
"Resource": [
"*"
],
"Condition": {
"Null": {
"sagemaker:VolumeKmsKey": [
"true"
]
}
}
}
]
}
- Enforce inter-container traffic encryption
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyUnencryptedTraffic",
"Effect": "Deny",
"Action": [
"sagemaker:CreateTrainingJob",
"sagemaker:CreateHyperParameterTuningJob"
],
"Resource": [
"*"
],
"Condition": {
"Bool": {
"sagemaker:InterContainerTrafficEncryption": "false"
}
}
}
]
}
- Enforce network isolation
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyNotIsolated",
"Effect": "Deny",
"Action": [
"sagemaker:CreateTrainingJob",
"sagemaker:CreateHyperParameterTuningJob",
"sagemaker:CreateModel"
],
"Resource": "*",
"Condition": {
"Bool": {
"sagemaker:NetworkIsolation": "false"
}
}
}
]
}
- Restricting notebook pre-signed URL to IPs
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RestrictUrlToIp",
"Effect": "Deny",
"Action": "sagemaker:CreatePresignedNotebookInstanceUrl",
"Resource": "*",
"Condition": {
"ForAllValues:NotIpAddress": {
"aws:SourceIp": [
"[ENTER_PUBLIC_IP_ADDRESS]"
]
}
}
}
]
}
- Disable Internet Access
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyDirectInternet",
"Effect": "Deny",
"Action": "sagemaker:CreateNotebookInstance",
"Resource": "*",
"Condition": {
"StringEquals": {
"sagemaker:DirectInternetAccess": [
"Enabled"
]
}
}
}
]
}
- Disable Root access in SageMaker Notebooks
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "SageMakerDenyRootAccess",
"Effect": "Deny",
"Action": [
"sagemaker:CreateNotebookInstance",
"sagemaker:UpdateNotebookInstance"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"sagemaker:RootAccess": [
"Enabled"
]
}
}
}
]
}
- Restrict the instance types that can be started by users
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "SageMakerLimitInstanceTypes",
"Effect": "Deny",
"Action": "sagemaker:CreateNotebookInstance",
"Resource": "*",
"Condition": {
"ForAnyValue:StringNotLike": {
"sagemaker:InstanceTypes": [
"[EXAMPLE_INSTANCE_TYPES]",
"ml.c5.xlarge",
"ml.m5.xlarge",
"ml.t3.medium"
]
}
}
}
]
}
- Similarly, for Studio, see the following sample policy. Note that administrators need to allow the system instance for the default Jupyter Server apps.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "SageMakerAllowedInstanceTypes",
"Effect": "Deny",
"Action": [
"sagemaker:CreateApp"
],
"Resource": "*",
"Condition": {
"ForAnyValue:StringNotLike": {
"sagemaker:InstanceTypes": [
"ml.c5.large",
"ml.m5.large",
"ml.t3.medium",
"system"
]
}
}
}
]
}
AWS Config has several managed rules to evaluate SageMaker
- sagemaker-endpoint-configuration-kms-key-configured
- sagemaker-endpoint-config-prod-instance-count
- sagemaker-notebook-instance-inside-vpc
- sagemaker-notebook-instance-kms-key-configured
- sagemaker-notebook-instance-root-access-check
- sagemaker-notebook-no-direct-internet-access
In the event of unauthorized access to your environment, see below for possible scenarios tied to relevant SageMaker API calls as a quick reference (please note that this is not a complete list of SageMaker API calls):
Data/Model Exfiltration:
-
Copying or downloading sensitive data from SageMaker data stores or model artifacts.
-
Extracting model parameters or training data, potentially exposing intellectual property or personal information.
Example API calls:
DescribeModelPackage
to retrieve information about model packages.DescribeTrainingJob
to access details of training jobs and their output data.GetModelPackageModelMetrics
to retrieve model metrics and potentially sensitive data.
Data Poisoning:
-
Modifying or poisoning trained models by injecting malicious data or adversarial examples.
-
Deploying compromised models to SageMaker endpoints, leading to incorrect predictions or malicious outputs.
Example API calls:
CreateModelPackage
orUpdateModelPackage
to deploy a compromised model package.CreateTransformJob
to execute a transform job with a poisoned model.CreateEndpointConfig
andCreateEndpoint
to deploy a malicious model to an endpoint.CreateTrainingJob
orUpdateTrainingJob
to inject malicious data into training jobs.
Resource Misuse:
-
Launching unauthorized SageMaker notebooks or instances for cryptomining or other malicious activities.
-
Using SageMaker resources as entry points or pivot points for lateral movement within the AWS environment.
Example API calls:
CreateNotebookInstance
orUpdateNotebookInstance
to launch unauthorized notebook instances.CreateTrainingJob
orCreateHyperParameterTuningJob
to initiate excessive training jobs.CreateEndpoint
orUpdateEndpoint
to create or modify endpoints for unauthorized purposes.
Denial of Service (DoS):
-
Exhausting SageMaker resources (e.g., compute instances, storage) by launching excessive training jobs or endpoints.
-
Overwhelming SageMaker APIs or services with a high volume of requests, leading to service disruptions.
Example API calls:
CreateTrainingJob
orCreateHyperParameterTuningJob
to launch numerous training jobs and exhaust resources.CreateEndpoint
orUpdateEndpoint
to create multiple endpoints and consume excessive compute resources.
Configuration Changes:
-
Modifying SageMaker roles, policies, or permissions to escalate privileges or grant unauthorized access.
-
Altering SageMaker VPC configurations, security groups, or network settings to bypass security controls.
Example API calls:
CreateRole
orUpdateRole
to modify SageMaker roles and permissions.CreateNotebookInstanceLifecycleConfig
orUpdateNotebookInstanceLifecycleConfig
to alter notebook instance configurations.CreateEndpointConfig
orUpdateEndpointConfig
to change endpoint configurations or security settings.
Log Tampering:
-
Modifying or deleting SageMaker logs or audit trails to cover tracks and hinder incident investigation.
-
Injecting false log entries to mislead security analysts or hide malicious activities.
Example API calls:
PutModelPackageModelMetrics
to inject false model metrics into logs.StopTrainingJob
orStopTransformJob
to potentially modify or delete log data.
Malware Deployment:
-
Deploying malware or backdoors within SageMaker notebooks or instances for persistent access or data theft.
-
Using SageMaker resources to distribute malware or launch attacks against other systems or networks.
Example API calls:
CreateNotebookInstance
orUpdateNotebookInstance
to launch notebook instances with malware.CreateModelPackage
orUpdateModelPackage
to deploy model packages containing malicious code.
Credential Theft:
-
Stealing AWS credentials or SageMaker API keys stored within notebooks or instances.
-
Using stolen credentials to gain further unauthorized access to other AWS resources or services.
Example API calls:
DescribeNotebookInstance
orDescribeTrainingJob
to potentially access stored credentials or API keys.GetModelPackageModelMetrics
orDescribeModelPackage
to retrieve sensitive information or credentials.
Cryptojacking:
-
Hijacking SageMaker compute resources (e.g., instances, endpoints) for unauthorized cryptocurrency mining activities.
-
Consuming excessive compute resources and potentially leading to service disruptions or increased costs.
Example API calls:
CreateNotebookInstance
orUpdateNotebookInstance
to launch instances for cryptocurrency mining.CreateTrainingJob
orCreateHyperParameterTuningJob
to initiate compute-intensive jobs for mining purposes.
Note: It's important to note that these API calls can also be used for legitimate purposes, but in the context of unauthorized access, they could be misused to perform risky actions. Implementing robust access controls, monitoring, and auditing mechanisms is crucial to detect and prevent such misuse of SageMaker APIs and resources.
The screenshots below provide a visual aid for an Incident Responder to assist in the interpretation of events found during an investigation. Each image below represents the action/s taken that match with the event name that is logged
CreateDomain
Expand Screenshot
- Creates a Domain. A domain consists of an associated Amazon Elastic File System volume, a list of authorized users, and a variety of security, application, policy, and Amazon Virtual Private Cloud (VPC) configurations. Users within a domain can share notebook files and other artifacts with each other.SageMaker Domain details
Expand Screenshot
Amazon SageMaker domain supports SageMaker machine learning (ML) environments. A SageMaker domain is composed of the following entities: Domain, User Profile, Shared Space, App
CloudTrail event for CreateDomain
Expand Screenshot
Note that the CreateDomain
event in Cloudtrail has all of the following information: VPC, subnets, execution role, apps, etc.
CloudTrail event for CreateEndpoint
Expand Screenshots
Note that the CreateEndpoint
event for SageMaker in CloudTrail is called by SageMaker-ExecutionRole
service role
In the event of an incident, in addition to investigating the indicators of compromise, threat actor, timeframe, etc., here are some additional questions to consider once it has been confirmed that this is an incident relating to SageMaker resources:
- Which SageMaker resources were accessed without authorization? (e.g., notebooks, models, endpoints, data stores)
- How was the unauthorized access gained? (e.g., compromised credentials, misconfigured permissions, exploited vulnerabilities)
- What actions were performed on the affected SageMaker resources during the unauthorized access?
- Were any models or data exfiltrated or tampered with?
- If models were accessed, is there a risk of model poisoning or adversarial attacks?
- Were any new resources (e.g., notebooks, endpoints) created or modified during the unauthorized access?
- Were any SageMaker APIs or SDKs used during the unauthorized access, and what actions were performed through them?
- Were any SageMaker logs or audit trails modified or deleted to cover tracks?
- Were any SageMaker roles or IAM policies modified or misused during the incident?
- Were any SageMaker VPC configurations or network settings altered?
- Were any SageMaker notebooks or instances used as entry points or pivot points for further unauthorized access?
- Were any SageMaker resources used to launch attacks or malicious activities against other AWS resources or external systems?
- What is the potential impact of the unauthorized access on the confidentiality, integrity, and availability of SageMaker resources and associated data?
- How can the affected SageMaker resources be securely isolated, backed up, and potentially recovered or rebuilt?
- What specific SageMaker security best practices or configurations were not followed, leading to the unauthorized access?
If any resources are created by an unauthorized user or resource were not authorized to be created, follow the following instructions on how to delete/modify created resources or permissions:
- How to delete a SageMaker domain (console)
- How to delete a SageMaker domain (CLI)
- How to delete a SageMaker Endpoint
- How to delete a SageMaker Endpoint Configuration
- How to delete a SageMaker Model
- Remove Root Access from SageMaker Notebooks (Go to desired notebook instance/Stop Instance/Once completed, click Edit/Under Permissions, select Disable/Update Notebook Instance/Run Instance)
- As an Incident Responder, I need to be able to monitor all critical SageMaker events
- As an Incident Responder, I need a playbook on querying SageMaker Cloudtrail events at scale
Strongly Recommended
- Deploy in an isolated VPC
- Use VPC endpoint to access resources
- Use security groups and NACLs to control traffic into and out of your environment
- Inter-container traffic encryption for training jobs with several compute instances
- Enable encryption at rest using KMS
- Disable root access to notebooks if it is not needed
- Lifecycle Configuration Best Practices
- Create an allow list of packages that team can use to reduce risk of malicious code running
- Least privilege using IAM roles and resource based policies (ex. bucket policies for accessing S3 bucket data), leverage ML Governance
- Use IAM Identity Center
- Store and rotate credentials in Secrets Manager
- Monitor model input and output using SageMaker Model Monitor
- Enable CloudTrail S3 data events logging for S3 data and model artifacts auditing
- Enable SageMaker Experiments and leverage version control for model artifacts
- Enable VPC Flow Logs to monitor network traffic in your VPC
- Use CodeArtifact to download libraries/packages needed from the internet
- CloudWatch can also be used to monitor SageMaker
Encouraged