This sample demonstrates how to apply DevOps with Azure Data Factory (ADF) by source controlling the data flows, and using a CI/CD pipeline to propagate changes in the data flows to the staging and production environment
This solution sets up an Azure Data Lake Gen2 storage account, with a folder structure that enables data tiering (bronze, silver, gold), and an Azure Data Factory(ADF) instance with linked services connecting to the data lake, to a separate file share and a key vault for secrets.
The Azure Data Factory contains a simple pipeline taking data from the file share and ingesting it to the bronze folder.
The ADF Pipeline definitions are stored in a git repository, and the CI/CD pipelines defined in Azure DevOps takes the produced ARM templates and deploys these across environments. And how to leverage pytest-adf do the integration testing.
The main purpose of this sample is not to showcase the data flows, but rather how to work with source control, and continuous delivery to ensure that the data flows are properly versioned, that you can re-create old datasets, and that any new changes are propagated to all environments.
The following shows the simple architecture of the Azure Data Factory Data Pipeline.
The following shows the logical flow of performing CI/CD with Azure Data Factory.
The following shows the overall CI/CD process as built with Azure DevOps Pipelines.
The following summarizes key learnings and best practices demonstrated by this sample solution
As with other code, pipelines generated in Azure Data Factory should be backed by source control.
This has many benefits, among others:
- Allows for collaboration between team members. Ideally changes to the data flow should be done through pull requests, with code reviews, to ensure good code quality.
- Allows for versioning of data flows, and returning to earlier versions to re-create a dataset
- Ensures that the data flows are not lost even if the Azure Data Factory is deleted.
- Include all artifacts needed to build the data pipeline from scratch in source control. This includes infrastructure-as-code artifacts, database objects (schema definitions, functions, stored procedures, etc), reference/application data, data pipeline definitions and data validation and transformation logic.
- There should be a safe, repeatable process to move changes through dev, test and finally production
- Ensure data pipelines are functioning as expected through automated integration tests.
- Maintain a central, secure location for sensitive configuration such as database connection strings, file server keys, etc, that can be accessed by the appropriate services within the specific environment.
- In this example we secure the secrets in one KeyVault per environment, and set up a linked service in ADF to query for the secrets.
ADF has built in git integration, and stores all the pipelines, datasets, linked services etc. as ARM templates in source control.
A typical workflow for a git backed ADF instance looks like follows.
-
Create a new branch in ADF
-
Make changes to the data flows, save the changes and create a pull request
-
Once the pull request is approved and merged, publish the changes in ADF
This will publish all the changes to the adf_publish branch, this will in turn trigger the CI/CD pipelines for promotion to staging and production.
NOTE: Only the ADF instance in the DEV environment should be backed by git. The data flows are propagated to the other environments by CI/CD pipelines
In this sample, we have chosen to create separate folders in the data lake for the different datasets/data products. You may also want to add a separate folder or storage area for malformed
data (data that fails validation from bronze to silver), a separate sys
folder for scripts, libraries or other binaries, and a sandbox
area, for intermediate datasets or other assets used in the process of generating new data products.
If multiple teams work on the same data lake, one option is also to create a separate container for each team, with their own tiered structure, each team with their own Azure Data Factory to allow for separation of data, pipelines and data access.
NOTE: To allow reading and writing of data lake folders from ADF, the ADF Managed Identity needs the Storage Blob Data Reader and Storage Blob Data Contributor Role, this is configurable on the container level.
Both Build and Release Pipelines are built using Azure DevOps
- Dev - the DEV resource group is used by developers to build and test their solutions.
- Stage - the STG resource group is used to test deployments prior to going to production in a production-like environment. Any integration tests are run in this environment
- Production - the PROD resource group is the final Production environment
Each environment has an identical set of resources
When a data flow developer clicks publish in ADF, this publishes the ARM template to the adf_publish branch which kicks off the CI/CD pipeline.
In the CI/CD pipeline, the following occurs:
- Stop any existing ADF triggers in STG
- Publish the ARM template to the STG environment
- Restart any ADF triggers
This kicks off the PROD CI/CD pipeline.
Optional You can go into the pipeline and add a manual trigger for production.
NOTE: all the resources that vary between environments, such as the data lake storage account, key vault etc. are exposed as ARM parameters. The deploy-adf-job.yml replaces the relevant parameters for each environment. If you add more linked services or other resources, make sure to update this pipeline.
- Github account
- Azure Account
- Permissions needed: ability to create and deploy to an azure resource group, a service principal, and grant the collaborator role to the service principal over the resource group.
- Azure DevOps Project
- Permissions needed: ability to create service connections, pipelines and variable groups.
- For Windows users, Windows Subsystem For Linux
- az cli 2.6+
- az cli - application insights extension
- To install, run
az extension add --name application-insights
- To install, run
- Azure DevOps CLI
- To install, run
az extension add --name azure-devops
- To install, run
- jq
IMPORTANT NOTE: As with all Azure Deployments, this will incur associated costs. Remember to teardown all related resources after use to avoid unnecessary costs. See here for list of deployed resources.
This deployment was tested using WSL 2 (Ubuntu 20.04)
-
Initial Setup
-
Ensure that:
- You are logged in to the Azure CLI. To login, run
az login
. - Azure CLI is targeting the Azure Subscription you want to deploy the resources to.
- To set target Azure Subscription, run
az account set -s <AZURE_SUBSCRIPTION_ID>
- To set target Azure Subscription, run
- Azure CLI is targeting the Azure DevOps organization and project you want to deploy the pipelines to.
- To set target Azure DevOps project, run
az devops configure --defaults organization=https://dev.azure.com/<MY_ORG>/ project=<MY_PROJECT>
- To set target Azure DevOps project, run
- You are logged in to the Azure CLI. To login, run
-
Fork this repository into a new Github repo.
-
Set the following required environment variables:
- GITHUB_REPO - Name of your imported github repo in this form
<my_github_handle>/<repo>
- GITHUB_PAT_TOKEN - a Github PAT token. Generate them here. This requires "repo" scope.
Optionally, set the following environment variables:
- RESOURCE_GROUP_LOCATION - Azure location to deploy resources. Default:
westus
. - AZURE_SUBSCRIPTION_ID - Azure subscription id to use to deploy resources. Default: default azure subscription. To see your default, run
az account list
. - RESOURCE_GROUP_NAME_PREFIX - name of the resource group. This will automatically be appended with the environment name. For example:
RESOURCE_GROUP_NAME_PREFIX-dev-rg
. Default: mdwdo-ado-${DEPLOYMENT_ID}. - DEPLOYMENT_ID - string appended to all resource names. This is to ensure uniqueness of azure resource names. Default: random five character string.
- AZDO_PIPELINES_BRANCH_NAME - git branch where Azure DevOps pipelines definitions are retrieved from. Default: main.
To further customize the solution, set parameters in
arm.parameters
files located in theinfrastructure
folder. - GITHUB_REPO - Name of your imported github repo in this form
-
-
Deploy Azure resources
cd
into thesingle_tech_samples/datafactory/sample1_cicd
folder of the repo- Run
./deploy.sh
.- After a successful deployment, you will find
.env.{environment_name}
files containing essential configuration information per environment. See here for list of deployed resources.
- After a successful deployment, you will find
- As part of the deployment script, this updated the Azure DevOps Release Pipeline YAML definition to point to your Github repository. Commit and push up these changes.
- This will trigger a Build and Release which will fail due to a lacking
adf_publish
branch -- this is expected. This branch will be created once you've setup git integration with your DEV Data Factory and publish a change.
- This will trigger a Build and Release which will fail due to a lacking
-
Setup ADF git integration in DEV Data Factory
- In the Azure Portal, navigate to the Data Factory in the DEV environment.
- Click "Author & Monitor" to launch the Data Factory portal.
- On the landing page, select "Set up code repository". For more information, see here.
- Fill in the repository settings with the following:
- Repository type: Github
- Github Account: your_Github_account
- Git repository name: imported Github repository
- Collaboration branch: main
- Root folder: /single_tech_samples/sample1_cicd/datafactory/adf
- Import Existing Data Factory resource to repository: Selected
- Branch to import resource into: Use Collaboration
- When prompted to select a working branch, select main
IMPORTANT NOTE: Only the DEV Data Factory should be setup with Git integration. Do NOT setup git integration in the STG and PROD Data Factories.
-
Trigger an initial Release
- In the DEV Data Factory portal, click
Publish
to publish changes.- Publishing a change is required to generate the
adf_publish
branch which is required in the Release pipelines. - Tips:in some case after you click the 'Publish' button and publish succeed then you check the
adf_publish
there is no any data factory ARM Template json code here, please try Data Factory portal -> Manage -> Git Configuration -> Overwrite live mode then checkadf_publish
branch again ARM Template should be there.
- Publishing a change is required to generate the
- In Azure DevOps, notice a new run of the Build Pipeline (mdw-adf-ci-artifacts) off
main
. - After completion, this should automatically trigger the Release Pipeline (mdw-adf-cd-release). This will deploy the artifacts across environments.
- Optional. Trigger the Data Factory Pipelines per environment.
- In the DEV Data Factory portal, click
Congratulations!! 🥳 You have successfully deployed the solution and accompanying Build and Release Pipelines.
After a successful deployment, you should have the following resources:
- In Azure, three Resource Groups (one per environment) each with the following Azure resources.
- Data Factory - with pipelines, datasets, linked services, triggers deployed and configured correctly per environment.
- Data Lake Store Gen2 and a Service Principal (SP) with Storage Contributor rights assigned.
- Blob storage and a Service Principal (SP) with Storage Contributor rights assigned.
- KeyVault with all relevant secrets stored.
- In Azure DevOps
- Two Azure Pipelines
- mdwdo-adf-cd-release - Release Pipeline
- mdwdo-adf-ci-artifacts - Build Pipeline
- Three Variables Groups - two per environment
- mdwdo-adf-release-dev
- mdwdo-adf-release-secrets-dev**
- mdwdo-adf-release-stg
- mdwdo-adf-release-secrets-stg**
- mdwdo-adf-release-prod
- mdwdo-adf-release-secrets-prod**
- Four Service Connections
- Three Azure Service Connections (one per environment) each with a Service Principal with Contributor rights to the corresponding Resource Group.
- mdwdo-adf-serviceconnection-dev
- mdwdo-adf-serviceconnection-stg
- mdwdo-adf-serviceconnection-prod
- Github Service Connection for retrieving code from Github
- mdwdo-adf-github
- Three Azure Service Connections (one per environment) each with a Service Principal with Contributor rights to the corresponding Resource Group.
- Two Azure Pipelines
Notes:
- **These variable groups are currently not linked to KeyVault due to limitations of creating these programmatically. See Known Issues, Limitations and Workarounds
The following lists some limitations of the solution and associated deployment script:
- Azure DevOps Variable Groups linked to KeyVault can only be created via the UI, cannot be created programmatically and was not incorporated in the automated deployment of the solution.
- Workaround: Deployment add sensitive configuration as "secrets" in Variable Groups with the downside of duplicated information. If you wish, you may manually link a second Variable Group to KeyVault to pull out the secrets. KeyVault secret names should line up with required variables in the Azure DevOps pipelines. See here for more information.
- Azure DevOps Environment and Approval Gates can only be managed via the UI, cannot be managed programmatically and was not incorporated in the automated deployment of the solution.
- Workaround: Approval Gates can be easily configured manually. See here for more information.
If you've encountered any issues, please file a Github issue with the relevant error message and replication steps.