From 774204284092c84d1a60b9f59cadd9b80eaf5fbd Mon Sep 17 00:00:00 2001 From: dlpzx <71252798+dlpzx@users.noreply.github.com> Date: Thu, 25 Apr 2024 14:24:15 +0200 Subject: [PATCH] Documentation in GitHub pages for release 2.4.0 (#1191) ### Feature or Bugfix - Feature ### Detail - Add details about multi-region pivot roles related to #1064 - Add details about SSM parameter account settings related to #1154 - Add details on all ECS tasks and commands on how to execute on demand task such as the one in #1151 ### Relates - #1064 - #1154 - #1151 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). `n/a` - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --- pages/architecture.md | 57 ++++++++++++++++++++++++++++++++++---- pages/deploy/deploy_aws.md | 38 ++++++++++++++----------- 2 files changed, 73 insertions(+), 22 deletions(-) diff --git a/pages/architecture.md b/pages/architecture.md index c88e7627f..5f025253e 100644 --- a/pages/architecture.md +++ b/pages/architecture.md @@ -275,6 +275,35 @@ Linux base image, and does not rely on Dockerhub. Docker images are built with AWS CodePipeline and stored on Amazon ECR which ensures image availability, and vulnerabilities scanning. +The following table includes an overview of the different ECS task definitions deployed in data.all. + + +| ECS task | trigger | module | Description +|-----------------|---------|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| cdkproxy | on-demand (by backend) | core | It deploys CDK stacks in data.all Environment accounts (e.g. Environments, Datasets, Notebooks...) | +| stacks-updater | scheduled (daily) | core | It updates all Environment and Dataset stacks | +| catalog-indexer | scheduled (every 6 hours) | core | It indexes new tables and data items in the data.all central catalog | +| tables-syncer | scheduled (every 15 mins)| datasets | It syncs tables in the Glue Catalog with the metadata of tables in data.all | +| subscriptions | scheduled (every 15 mins) | datasets | It retrieves data from shared items and posts it in an SNS topic | +| share-manager | on-demand (by backend) | dataset_sharing | It executes data shares in source and target accounts (bucket sharing, table sharing, folder sharing) | +| share-verifier | scheduled (weekly) | dataset_sharing | It verifies all shared items and updates their health status. | +| share-reapplier | on-demand (manually by data.all admins) | dataset_sharing | It reapplies all unhealthy shared items in data.all. It can be used by data.all admins in case an upgrade or any other unforeseen event damages the current shares. | + +**Trigger an ECS task manually** +Exceptionally, data.all admins might need to trigger some of these ECS tasks manually. They can do so directly from the +AWS Console making sure they select the correct networking parameters, which as shown in the following commands, can be obtained from SSM Parameter Store. +``` +export cluster_name=$(aws ssm get-parameter --name /dataall//ecs/cluster/name --output text --query 'Parameter.Value') +export private_subnets=$(aws ssm get-parameter --name /dataall//ecs/private_subnets --output text --query 'Parameter.Value') +export security_groups=$(aws ssm get-parameter --name /dataall//ecs/security_groups --output text --query 'Parameter.Value') +export task_definition=$(aws ssm get-parameter --name /dataall//ecs/task_def_arn/stacks_updater --output text --query 'Parameter.Value') +network_config=\"awsvpcConfiguration={subnets=[$private_subnets],securityGroups=[$security_groups],assignPublicIp=DISABLED}\" +cluster_arn=\"arn:aws:ecs:::cluster/$cluster_name\"", +aws ecs run-task --task-definition $task_definition --cluster \"$cluster_arn\" --launch-type \"FARGATE\" --network-configuration \"$network_config\" --launch-type FARGATE --propagate-tags TASK_DEFINITION + +``` + + ### Amazon Aurora data.all uses Amazon Aurora serverless – PostgreSQL version to persist the application metadata. For example, for each data.all concept (data.all environments, datasets...) there is a table in the Aurora database. Additional tables @@ -332,19 +361,35 @@ performance from actual user sessions in near real time. ## Linked Environments Environments are workspaces where one or multiple teams can work. They are the door between our users in data.all and AWS, that is -why we say that we "link" environments because we link each environment to **ONE** AWS account, in one specific region. +why we say that we "link" environments because we link each environment to **ONE** AWS account, in **ONE** specific region. Under each environment we create other data.all resources, such as datasets, pipelines and notebooks. -For the deployment of -CloudFormation stacks we call upon a CDK trust policy between the Deployment account and the Environment account. +For the deployment of CloudFormation stacks we call upon a CDK trust policy between the Deployment account and the Environment account. As for the SDK calls, from the deployment account we assume a certain IAM role in the environment accounts, the pivotRole. +This role can only be assumed by the backend IAM roles using an externalId. -Consequently, to link one AWS account with an environment, the account must verify two -conditions: +To link one AWS account with an environment, the account must verify the following conditions: 1. AWS account is bootstrapped with CDK and is trusting data.all deployment account. -2. pivotRole IAM role is created on the AWS account and trusts data.all deployment account. +2. (optional) If the cdk.json parameter `enable_pivot_role_auto_create` is set to `False` then users need to manually create a pivotRole IAM role with the template provided in the UI. + +### Pivot role options +The pivot role is a key piece of data.all architecture, it is the role assumed by the backend components to carry out +AWS actions in the Environment accounts. There are 3 ways of configuring the pivot role: +- 1. **Manual pivot role**: Users need to manually create +the IAM role in each of the environment accounts. Whenever new features are introduced and the pivot role needs to be updates, users need to +manually update the pivot role template. IAM policies cannot be scoped down to new resources imported in data.all. +- 2. **Single-region CDK pivot role**: The pivot role is created and updated as part of the environment CDK stack. Users do not need to perform any actions. The IAM policies of the role can be scoped down to imported resources. Only one environment-region can be linked to data.all. +- 3. **Multi-region CDK pivot role**: Same as the Single-region CDK pivot role, but it allows users to create multiple environments in the same AWS account. Only one environment per region can be created. + +**Recommendation**: We strongly recommend users to avoid manual pivot roles. Between the single-region and multi-region, we recommend using the multi-region pivot role as it allows both use cases single region and multi region. + + | Type | IAM Role Name | cdk.json | config.json | + |-----------|---------------------------------|-------------------------------------------|---------------------------------------------------------------| + | Manual pivot role | `dataallPivotRole` | `enable_pivot_role_auto_create` = `False` | Not applicable | + | Single-region CDK pivot role | `dataallPivotRole-cdk` | `enable_pivot_role_auto_create` = `True` | `cdk_pivot_role_multiple_environments_same_account` = `False` | + | Multi-region CDK pivot role | `dataallPivotRole-cdk-` | `enable_pivot_role_auto_create` = `True` | `cdk_pivot_role_multiple_environments_same_account` = `True` | ![archi](img/architecture_linked_env.drawio.png#zoom#shadow) diff --git a/pages/deploy/deploy_aws.md b/pages/deploy/deploy_aws.md index d9b6d68c0..4d9772723 100644 --- a/pages/deploy/deploy_aws.md +++ b/pages/deploy/deploy_aws.md @@ -483,7 +483,8 @@ the different configuration options. }, "core": { "features": { - "env_aws_actions": true + "env_aws_actions": true, + "cdk_pivot_role_multiple_environments_same_account": false } } } @@ -576,23 +577,26 @@ In addition to disabling / enabling, some module features allow for additional c | custom_confidentiality_mapping | datasets | Provides custom confidentiality mapping json which maps your custom confidentiality levels to existing data.all confidentiality
For e.g. ```custom_confidentiality_mapping : { "Public" : "Unclassified", "Private" : "Official", "Confidential" : "Secret", "Very Highly Confidential" : "Secret"}```
This will display confidentiality levels - Public, Private, Confidential & Very Highly Confidential - in the confidentiality drop down and maps it existing confidentiality levels in data.all - Unclassified, Official and Secret | -### Disable core features +### Disable and customize core features In some cases, customers need to disable features that belong to the core functionalities of data.all. One way to restrict a particular feature in the core is to add it to the core section of the `config.json` and enable/disable it. ```json "core": { "features": { - "env_aws_actions": true + "env_aws_actions": true, + "cdk_pivot_role_multiple_environments_same_account": false } } ``` -This is the list of core features that can be switched on/off at the moment. Take it as an example if you need to -disable any other core feature. +This is the list of core features that can currently be customized. Take it as an example if you need to +disable or modify the bahavior any other core feature. + +| **Feature** | **Module** | **Description** | +|-----------------------|----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| env_aws_actions | environments | If set to True, users can get AWS Credentials and assume Environment Group IAM roles from data.all's UI | +| cdk_pivot_role_multiple_environments_same_account | environments | If set to True, the CDK-created pivot role as part of the environment stack will be region specific (`dataallPivotRole-cdk-`). This feature allows users to create multiple data.all environments in the same account but multiple regions. | -| **Feature** | **Module** | **Description** | -|-----------------------|----------------|----------------------------------------------------------------------------------| -| env_aws_actions | environments | Get AWS Credentials and assume Environment Group IAM roles from data.all's UI | ## 8. Run CDK synth and check cdk.context.json Run `cdk synth` to create the template that will be later deployed to CloudFormation. @@ -754,14 +758,16 @@ When setting the value to `false`, backend resources become smaller but you save These are the resources affected: -| Backend Service |prod_sizing| Configuration -|-----------------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -|Aurora |true | - Deletion protection enabled
- Backup retention of 30 days
- Paused after 1 day of inactivity
- Max capacity unit of 16 ACU
- Min capacity unit of 4 ACU | -|Aurora |false | - Deletion protection disabled
- No backup retention
- Paused after 10 mintes of inactivity
- Max capacity unit of 8 ACU
- Min capacity unit of 2 ACU | -|OpenSearch |true | - The KMS key of the OpenSearch cluster is kept when the CloudFormation stack is deleted
- Cluster configured with 3 master node and 2 data nodes
- Each data node has an EBS volume of 30GiB attached to it | -|OpenSearch |false | - The KMS key of the OpenSearch cluster gets deleted when the CloudFormation stack is deleted
- Cluster configured with 0 master node and 2 data nodes
- Each data node has an EBS volume of 20GiB attached to it | -|Lambda function |true | - Lambda functions are configured with more memory | -|Lambda function |false | - Lambda functions are configured with less memory | +| Backend Service | prod_sizing | Configuration +|---------------------|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Aurora | true | - Deletion protection enabled
- Backup retention of 30 days
- Paused after 1 day of inactivity
- Max capacity unit of 16 ACU
- Min capacity unit of 4 ACU | +| Aurora | false | - Deletion protection disabled
- No backup retention
- Paused after 10 mintes of inactivity
- Max capacity unit of 8 ACU
- Min capacity unit of 2 ACU | +| OpenSearch | true | - The KMS key of the OpenSearch cluster is kept when the CloudFormation stack is deleted
- Cluster configured with 3 master node and 2 data nodes
- Each data node has an EBS volume of 30GiB attached to it | +| OpenSearch | false | - The KMS key of the OpenSearch cluster gets deleted when the CloudFormation stack is deleted
- Cluster configured with 0 master node and 2 data nodes
- Each data node has an EBS volume of 20GiB attached to it | +| Lambda function | true | - Lambda functions are configured with more memory | +| Lambda function | false | - Lambda functions are configured with less memory | +| SSM Parameter Store | true | - SSM in the AWS account is configured for high throughput (check [service quotas](https://docs.aws.amazon.com/general/latest/gr/ssm.html#limits_ssm)) | +| SSM Parameter Store | false | - SSM in the AWS account is configured for default throughput (check [service quotas](https://docs.aws.amazon.com/general/latest/gr/ssm.html#limits_ssm)) | ### I used the wrong accounts or made another mistake in the deployment. How do I un-deploy data.all? In the above steps we are only deploying data.all tooling resources. Hence, if the CI/CD CodePipeline pipeline has not