Deploying Enterprise-grade Azure Databricks environment using Infrastructure as Code aligned with Anti-Data-Exfiltration Reference architecture
In sample 1, we focused on deploying a basic Azure Databrick Environment with relevant services like Azure Storage Account and Azure Keyvault provisioned. However, an Enterprise-grade deployment of Databricks demands securing the environment to meet the organizational guardrails around cybersecurity and data protection.
In this sample we focus on hardening the security around the Azure Databricks environment by implementing the following:
1.Azure Virtual Network - To achieve network isolation.
2.Hub and Spoke Network topology - To implement Perimeter networks.
3.Azure Private Links - To secure connectivity with dependent PaaS services.
This sample is also aligned to the implementation pattern published by Databricks around Data Exfiltration Protection with Azure Databricks.
The sample implements automating the provisioning of the required services and configurations using the Infrastructure as Code pattern.
The following list captures the scope of this sample:
- Provision an enterprise-grade, secure, Azure Databricks environment using ARM templates orchestrated by a shell script.
- The following services will be provisioned as a part of this setup:
- Azure Databricks Data Plane configured to have no public IPs (NPIP) deployed within an Azure Spoke Vnet.
- Azure Hub Vnet peered with the spoke Vnet.
- Azure firewall deployed into the hub Vnet configured to allow traffic only from the Azure Databricks Control plane as per IP routes published here.
- Azure Public Ip for the Firewall.
- Azure Routing tables with UDR (User Defined Routes) configured to enforce regulation of traffic through the firewall.
- Azure Data Lake Gen2 storage ABFS endpoint accessible only via Azure Private Link.
- Azure key vault to store secrets and access tokens accessible via Azure private link.
Details about how to use this sample can be found in the later sections of this document.
The architecture of the solution is aligned with the security baselines for Azure Databricks. The following diagram captures the high-level design.
In this sample, a shell script is used to orchestrate the deployment. The following diagram illustrates the deployment process flow.
Following are the cloud design patterns being used by this sample:
- External Configuration Store pattern: Configuration for the deployment is persisted externally as a parameter file separate from the deployment script.
- Federated Identity pattern: Azure active directory is used as the federated identity store to enable seamless integration with enterprise identity providers.
- Valet Key pattern: Azure key vault is used to manage the secrets and access toked used by the services.
- Gatekeeper pattern: The Azure firewall acts as a gatekeeper for all external traffic flowing in.
The following technologies are used to build this sample:
- Azure Databricks
- Azure Storage
- Azure Key Vault
- Azure Virtual networks
- Azure Firewall
- Azure Route tables
- Azure Public IP
- Azure Private Links
- Azure CLI
- Azure Resource Manager
This section highlights key pointers to align the services deployed in this sample to Microsoft Azure's Well-Architected Framework (WAF).
This sample implementation focuses on securing the Azure Databricks Environment against Data Exfiltration, aligning it to the best practices defined in the Security pillar of Microsoft Azure's Well-Architected Framework. Following are a few other guidance related to securing the Azure Databricks environment.
-
Ensure the right privileges are granted to the provisioned resources.
-
Cater for regular audits to ensure ongoing Vigilance.
-
Automate the execution of the deployment script and restrict the privileges to service accounts.
-
Integrate with the secure identity provider (Azure Active Directory).
-
Cost Optimization
-
Before the deployment, use the Azure pricing calculator to determine the expected usage cost.
-
Appropriately select the Storage redundancy option.
-
Leverage Azure Cost Management and Billing to track usage cost of the Azure Databricks and Storage services.
-
Use Azure Advisor to optimize deployments by leveraging the smart insights.
-
Use Azure Policies to define guardrails around deployment constraints to regulate the cost.
-
-
Operational Excellence
-
Ensure that the parameters passed to the deployment scripts are validated.
-
Leverage parallel resource deployment where ever possible. In the scope of this sample, all three resources can be deployed in parallel.
-
Validate compensation transactions for the deployment workflow to reverse partially provisioned resources if the provisioning fails.
-
-
Performance Efficiency
-
Understand billing for metered resources provisioned as a part of this sample.
-
Track deployment logs to monitor execution time to mine possibilities for optimizations.
-
-
Reliability
-
Define the availability requirements before the deployment and configure the storage and databricks service accordingly.
-
Ensure required capacity and services are available in targeted regions.
-
Test the compensation transaction logic by explicitly failing a service deployment.
-
This section holds the information about usage instructions of this sample.
The following are the prerequisites for deploying this sample :
- Github account
- Azure Account
-
Permissions needed: The ability to create and deploy to an Azure resource group, a service principal, and grant the collaborator role to the service principal over the resource group.
-
Active subscription with the following resource providers enabled:
- Microsoft.Databricks
- Microsoft.DataLakeStore
- Microsoft.Storage
- Microsoft.KeyVault
- Microsoft.Network
-
- Azure CLI installed on the local machine
- Installation instructions can be found here
- For Windows users,
- Option 1: Windows Subsystem for Linux
- Option 2: Use the dev container published here as a host for the bash shell.
IMPORTANT NOTE: As with all Azure Deployments, this will incur associated costs. Remember to teardown all related resources after use to avoid unnecessary costs. See here for a list of deployed resources.
Below listed are the steps to deploy this sample :
-
Fork and clone this repository. Navigate to (CD)
single_tech_samples/databricks/sample2_enterprise_azure_databricks_environment/
. -
The sample depends on the following environment variables to be set before the deployment script is run:
DEPLOYMENT_PREFIX
- Prefix for the resource names which will be created as a part of this deployment.AZURE_SUBSCRIPTION_ID
- Subscription ID of the Azure subscription where the resources should be deployed.AZURE_RESOURCE_GROUP_NAME
- Name of the containing resource group.AZURE_RESOURCE_GROUP_LOCATION
- Azure region where the resources will be deployed. (e.g. australiaeast, eastus, etc.).DELETE_RESOURCE_GROUP
- Flag to indicate the cleanup step for the resource group.
-
Optional Step- Please note that the firewall rules configured in this sample are as per Microsoft Azure documentation captured here. The routes are configured for West US region. In case you need to deploy this sample in any other Azure region, please alter the ARM template for the firewall with the IP addresses for the target region.
-
Run '/deploy.sh'
Note: The script will prompt you to log in to the Azure account for authorization to deploy resources.
The script will validate the ARM templates and the environment variables before deploying the resources. It will also display the status of each stage of the deployment while it executes. The following screenshot displays the log for a successful run:
Note:
DEPLOYMENT_PREFIX
for this deployment was set aslumussample2
The following resources will be deployed as a part of this sample once the script is executed:
1.Azure Databricks workspace.
2.Azure Storage with hierarchical namespace enabled.
3.Azure Key vault with all the secrets configured.
4.Azure Virtual Networks with Vnet peering between the hub and spoke Vnets
5.Azure Firewall with rules configured
6.Azure public IP address associated with the firewall
7.Azure Routing table with routes configured
8.Azure Network Security Group
8.Azure Private Link
NOTE: Configuring Private links require provisioning of Network Interface cards and Private DNS zones. The following screenshot illustrates the resources configured for two private links. One for the Storage and Another for Key vault.
The following steps can be performed to validate the correct deployment of this sample:
-
Users with appropriate access rights should be able to:
- Launch the workspace from the Azure portal.
- Access the control plane for the storage account and key vault through the Azure portal.
- View the secrets configured in the Azure Key vault.
- View deployment logs in the Azure resource group .
- Key vault and storage is not accessible outside Azure Databricks workspace.
- Changing the firewall rules to deny traffic from the control plane will prevent the Azure Databricks cluster from functioning. .
The clean-up script can be executed to clean up the resources provisioned in this sample. Following are the steps to execute the script:
-
Navigate to (CD)
single_tech_samples/databricks/sample1_basic_azure_databricks_environment/
. -
Run '/destroy.sh'
The following screenshot displays the log for a successful clean-up run:
Cluster provisioning and enabling data access on a pre-provisioned Azure Databricks Workspace