Azure Data Share setup requires a number of steps to establish the connection between the source data and the destination. One of those steps is sending an invitation from a source data share account and accepting the invitation in a destination data share account.
Through the portal UI, invitations can only be sent to email addresses and that requires the email recipient to perform some manual steps to accept the invitation and map the incoming data to the destination. However, the Azure Data Share SDK allows invitations to be sent to service principals as well, which opens up the opportunity to fully automate the process, even between different subscriptions and tenants.
This code illustrates how to perform a fully automated data sharing process between two pre-existing Data Share accounts.
- Working with the sample
- Running the sample
- Removing the sample assets
This sample includes two Python scripts and an azure function:
- Creates a share in a data share account
- Sets up a dataset from a ADLSGen2 storage account file system
- Creates a synchronization schedule for the share
- Sends out an invitation to a service principal on a specific Azure AD tenant
- Lists invitations sent to the user
- Creates a subscription for a share
- Creates mappings for the shared datasets that point to an ADLSGen2 storage account file system
- Enables the scheduling trigger
An Azure Function with a timer trigger is included, so the acceptance of invitations can be fully automated instead of on-demand.
This code can be found on the azure_function
folder but it's recommended that the scripts are used for initial debugging and testing.
A dev container is included to allow for an easier setup of your local dev environment. It includes all the required dependencies to execute the code and is the recommended way of testing this sample. For more information, please refer to https://code.visualstudio.com/docs/devcontainers/containers.
- A source Azure Data Share account
- A source Azure Storage Data Lake account (Gen2)
- A destination Azure Data Share account
- A destination Azure Storage Data Lake account (Gen2)
The source and destination assets can be created in different Azure subscriptions and tenants.
The infra
folder includes bash and powershell scripts to setup these 4 assets in a new resource group under a single subscription. The scripts also create a container in the source storage account and upload this Readme.md file to it so we have some data to be shared.
Note:
- Suffix parameter should not include any special characters like '-' or '_'
- Azure CLI is required to execute these scripts.
- Datashare commands in Azure CLI are still in preview at the time of this writing so you may be prompted to install the extension when executing this script
# login
az login
# make sure the right subscription is selected
az account set -s <SUBSCRIPTION_ID>
# setup infrastructure using a suffix to avoid naming conflicts
./infra/bash_setup_infra.sh <suffix>
# login
az login
# make sure the right subscription is selected
az account set -s <SUBSCRIPTION_ID>
# setup infrastructure using a suffix to avoid naming conflicts
./infra/ps_setup_infra.ps1 <suffix>
To automate the acceptance of the invitation, a service principal must be created in the destination Azure AD tenant.
az ad sp create-for-rbac --name "<insert_sp_name>"
It will output something like this:
{
"appId": "b50dc79f-7120-46e6-8703-1ebdb0a5e66b",
"displayName": "<service-principal-name>",
"name": "b50dc79f-7120-46e6-8703-1ebdb0a5e66b",
"password": "<generated-client-secret>",
"tenant": "<your-tenant-id>"
}
The appId, password and tenant are required by the Python scripts.
Additionally, we need the objectId of the service principal, which can be obtained by running the following command:
az ad sp list --display-name <insert_sp_name> --query []."id" -o tsv
These values will be used by the scripts through environment variables:
-
source.py
will send the invitation to our service principal using the following environment variables:DESTINATION_OBJECT_ID
: objectIdDESTINATION_TENANT_ID
: tenant
-
dest.py
will execute with the service principal identity so it can accept the invitation. This is accomplished by defining the following environment variables:AZURE_CLIENT_ID
: appIdAZURE_CLIENT_SECRET
: passwordAZURE_TENANT_ID
: tenant
The scripts are prepared to read .env
files that include these values. Detailed instructions on how to create these files can be found further down.
Note: If you used the infra scripts in this repo to create the resources, the role assignements have been setup automatically.
-
The source data share MSI must have the Storage Blob Data Reader role in the source storage account
-
The destination data share MSI must have the Storage Blob Data Contributor role in the destination storage account.
First we need to create the share, select the data we want to share and send an invitation. Those tasks are all automated in the python\source.py
script.
In the python
folder, add a source.env
file with the following content and update the settings to match your configuration. These should point to the source storage and data share accounts.
# data share details
DATA_SHARE_AZURE_SUBSCRIPTION_ID=<subscription_id>
DATA_SHARE_RESOURCE_GROUP=data-share-automation
DATA_SHARE_ACCOUNT_NAME=source-data-share<suffix>
SHARE_NAME=test-share
DATASET_NAME=test-dataset
# storage account settings
STORAGE_AZURE_SUBSCRIPTION_ID=<subscription_id>
STORAGE_RESOURCE_GROUP=data-share-automation
STORAGE_ACCOUNT_NAME=sourcestorage<suffix>
FILE_SYSTEM_NAME=share-data
# invitation recipient
DESTINATION_TENANT_ID=<destination_tenant_id>
DESTINATION_OBJECT_ID=<destination_object_id>
These values are used in the python/source.py
script.
The source script uses the DefaultAzureCredential class for authenticating on Azure. It uses one of several authentication mechanisms:
-
The EnvironmentCredential is the first method tried by the DefaultAzureCredential class. It will look for the environment variables AZURE_CLIENT_ID, AZURE_SECRET_ID and AZURE_TENANT_ID
-
It they aren't set, it will try to use Managed Identities which will be available in a hosting environment such as Azure App Service.
-
If not present, it will try AZ CLI login credentials, provided az login command is ran before using it. This is a simple way of running the script under your Azure account identity (and if you created the assets yourself, you should already have all the required permissions and roles over the resources)
The identity running the script must have the following permissions:
Source Data Share Account:
- Contributor
Source Storage Account:
- Contributor
- User Access Administrator
The simplest process is to run the script using your AZ CLI credentials which requires no setup as the same credentials were used to create the resources.
However, if you want to run in the context of a service principal account, you first need to ensure the permissions are correctly set and then add the following values to source.env
file.
# service principal identity if using EnvironmentCredentials
AZURE_CLIENT_ID=client_id
AZURE_CLIENT_SECRET=client_secret
AZURE_TENANT_ID=tenant_id
The script is prepared to read this file if it exists and will default to using service principal credentials if these values are included.
Execute the following commands:
cd python
# not required if running in the dev container
pip install -r requirements.txt
# if using az cli credentials
az login
python source.py
The script should be indempotent so you can run it multiple times. As a result, you should see a share created in your source data share account:
A schedule should be configured on the share:
A dataset should be configured and mapped to the storage account container:
Finally, an invitation should exist for the service principal:
Now, we need to accept the sent invitation, map the incoming data to the destination storage account and setup the schedule. Those tasks are automated in the python\dest.py
script.
In the python
folder, add a dest.env
file with the following content and update the settings to match your configuration. These should point to the destination storage and data share accounts.
# data share details
DATA_SHARE_AZURE_SUBSCRIPTION_ID=c79b8ea1-f966-4bf0-b880-4c426764bbfb
DATA_SHARE_RESOURCE_GROUP=data-share-automation
DATA_SHARE_ACCOUNT_NAME=dest-data-share<suffix>
# storage account settings
STORAGE_AZURE_SUBSCRIPTION_ID=c79b8ea1-f966-4bf0-b880-4c426764bbfb
STORAGE_RESOURCE_GROUP=data-share-automation
STORAGE_ACCOUNT_NAME=deststorage<suffix>
# service principal information
AZURE_CLIENT_ID=<service_principal_app_id>
AZURE_CLIENT_SECRET=<service_principal_password>
AZURE_TENANT_ID=<service_principal_tenant>
Note: while in the source.env file we used the object_id of the service principal as the target of the invitation, here we use the app_id for authentication purposes.
To automate the acceptance of the invitation sent to the service principal, you must run the dest.py
script under that identity. The dest.env
file must include the AZURE_CLIENT_ID, AZURE_CLIENT_SECRET and AZURE_TENANT_ID values.
This will ensure the script runs in the context of your service principal account and is able to access the sent invitation.
The service principal must have the following permissions:
Destination Data Share Account:
- Contributor
Destination Storage Account:
- Contributor
- User Access Administrator
If you want to automatically assign these permissions, you can use the bash_setup_sp_permissions.sh
or ps_setup_sp_permissions.sh
script in the infra
folder:
# login
az login
# make sure the right subscription is selected
az account set -s <SUBSCRIPTION_ID>
# add permissions for the service principal objectId
./infra/bash_setup_sp_permissions.sh <suffix> <objectId>
# login
az login
# make sure the right subscription is selected
az account set -s <SUBSCRIPTION_ID>
# add permissions for the service principal objectId
./infra/ps_setup_sp_permissions.ps1 <object_id_of_service_principal>
Execute the following commands:
cd python
# not required if running in the dev container
pip install -r requirements.txt
python dest.py
After the invitation is accepted, the script can't be run again. If you need to re-run it, please create a new invitation using the source.py
python script.
After the script executes you should see a subscription setup on the Received Shares on the destination azure data share account:
You can also see the incoming data being mapped to the destination storage account:
Finally, you can see the scheduling setup for the incoming data share:
You can now wait for the scheduled time on the data share subscription or force a snapshot sync in the destination data share account:
Soon you will see the Readme.md file in the destination storage account, inside the mapped container.
We can take the destination script and code it as an Azure Function with a timer trigger. This way, we have a reliable way to automate the process of accepting invitations.
The azure_function
folder includes the code required. To execute the code locally, a local.settings.json
file should be created in the azure_function
folder with the following content:
{
"IsEncrypted": false,
"Values": {
"FUNCTIONS_WORKER_RUNTIME": "python",
"AzureWebJobsStorage": "UseDevelopmentStorage=true",
"DATA_SHARE_ACCOUNT_NAME": "",
"DATA_SHARE_RESOURCE_GROUP_NAME": "<",
"DATA_SHARE_AZURE_SUBSCRIPTION_ID": "",
"DESTINATION_STORAGE_ACCOUNT_NAME": "",
"DESTINATION_STORAGE_RESOURCE_GROUP_NAME": "",
"DESTINATION_STORAGE_SUBSCRIPTION_ID": "",
"AZURE_CLIENT_ID": "",
"AZURE_CLIENT_SECRET": "",
"AZURE_TENANT_ID": ""
}
}
Note: update these settings to match the values created for
source.env
pointing at the destination storage and data share accounts.
- Azure Function Core Tools must be installed
- Azurite extension is required for local debugging - alternatively, the AzureWebJobsStorage must be configured to use a real Azure Storage account
The .vscode/launch.json
includes the required configuration to debug the Azure Function. Make sure the Debug Azure Function configuration is selected on the Run and Debug (Ctrl+Shift+D) options.
Before you execute the function locally, make sure Azurite is started by running the Azurite: Start
command in Visual Studio Code.
The function will go through the same steps as the dest.py
script detailed above and will achieve the same result.
Note: the function will quickly exit if no invitations are found. You can use the
source.py
file on thepython
folder to setup an invitation before running the function to test the behavior.
Note that the function is also using the service principal created before. This will allow the function to accept the invitation but also to authenticate against the Data Share service and the destination storage account to setup the share subscription.
As a reminder the service principal must have the following permissions:
Destination Data Share Account:
- Contributor
Destination Storage Account:
- Contributor
- User Access Administrator
However, a good option might be to use Azure Managed Identities instead once the function is deployed to Azure. In that case, you need to perform the following steps:
-
Do not include
AZURE_CLIENT_ID
,AZURE_CLIENT_SECRET
andAZURE_TENANT_ID
in the function settings. It will default to use the managed identity for authentication and authorization. -
Make sure the managed identity of the Azure Function has the correct permissions over Data Share account and storage account (as defined above for the service principal)
-
Modify
source.py
so that the invitation is sent to the objectId of the managed identity instead of the service principal. As a reminder, you can get this value by running the following command:az ad sp list --display-name <managed_identity_name> --query []."id" -o tsv
-
Finally, ensure the managed identity is given the right permissions over the storage account and data share resources.
-
Remember that permissions can be granted by using the
bash_setup_sp_permissions.sh
orps_setup_sp_permissions.sh
script in theinfra
folder.
If you used the scripts included in the infra
folder to setup the infrastructure, you can simply delete the created resource group data-share-automation
to clean up all the assets and avoid additional costs.
az group delete --resource-group data-share-automation