Demonstrates how to run non Python custom code as a step in Azure ML pipelines. We've seen many cases where companies already have custom code they use to preprocess data before training a ML model. This sample also demonstrates how to use PyTest to mock Azure ML SDK for unit testing.
What does this sample demonstrate:
- Run non Python tool to process Azure ML Datasets as a step in Azure ML pipeline, check return code and capture stdout.
- Create unit tests by using a fixture for mocking Azure ML SDK.
What doesn't this sample demonstrate:
- ML model training or deployment.
This wrapper script calls the command line command. In the basic sample it is just a cp
to move data from the input folder to the output folder of its AML pipeline step.
process = subprocess.Popen(['cp',
'{0}/.'.format(mount_context.mount_point),
step_output_path, '-r', '-v'],
stdout=subprocess.PIPE,
universal_newlines=True)
So it is possible to call any tool or program which can be executed on a ubuntu linux (which is the base image for AML pipeline steps). The tool(s) need to be installed in the custom container image. This Dockerfile is used to build a Docker image in the Azure ML pipeline.
It's important to know that the input folder is getting mounted within the wrapper script, so you can only work on the data after this code:
mount_context = dataset.mount()
mount_context.start()
print(f"mount_point is: {mount_context.mount_point}")
The mount point or folder is stored in this attribute mount_context.mount_point
and can be used in the command line call. Similarly the output folder for this step is stored in step_output_path
.
As an example on how to extend this template I will use this blog post about resizing images with ImageMagick.
- Adding ImageMagick to the Dockerfile for the custom preprocessing step
- Change the command line call in the Wrapper script
- Rebuilding, publish and run the data_processing_os_cmd_pipeline
Adding ImageMagick to the Dockerfile for the custom preprocessing step
Adding the installation instruction apt-get install -y imagemagick && \
to the Dockerfile just before apt-get clean
is called:
RUN apt-get update --fix-missing && \
apt-get install -y wget bzip2 && \
apt-get install -y fuse && \
apt-get install -y imagemagick && \
apt-get clean -y && \
rm -rf /var/lib/apt/lists/*
Change the command line call in the Wrapper script
Assumption: Input pictures are all jpg and output pictures should be 100x100. Input dataset is http://download.tensorflow.org/example_images/flower_photos.tgz, only the subfolder
daisy
will be resized.
Changing the command to:
process = subprocess.Popen(['convert',
'{0}/daisy/*.jpg'.format(mount_context.mount_point),
'-resize',
'100x100!',
'{0}/resized.jpg'.format(step_output_path)],
stdout=subprocess.PIPE,
universal_newlines=True)
test_fixtures.py is an example of how to mock Azure ML SDK using pytest_mock
. test_build_data_processing_os_cmd_pipeline.py uses the mocks to unit test Azure ML pipeline code.
- Whether you run this project locally or in Azure DevOps CI/CD pipelines, the code needs to get Azure ML context for remote or offline runs. Create Azure resources as documented here.
- Review the folder structure explained here.
-
Make a copy of .env.example, place it in the root of this sample, configure the variables, and rename the file to
.env
. -
Use the VSCode dev container, or install Anaconda or Mini Conda and create a Conda envrionment by running local_install_requirements.sh.
-
In VSCode, open the root folder of this sample, select the Conda environment created above as the Python interpretor.
-
Publish and run Azure ML pipelines. Note that if you change the Dockerfile, set the variable
AML_REBUILD_ENVIRONMENT
in.env
file totrue
for Azure ML to build an updated image.- To run the unit tests, open a terminal, activate the Conda environment for this sample, navigate to the root folder of this project, run
python -m pytest
- To publish and run Azure ML pipelines, run:
# publish the Azure ML pipeline python -m ml_service.pipelines.build_data_processing_os_cmd_pipeline # run the Azure ML pipeline python -m ml_service.pipelines.run_data_processing_pipeline --aml_pipeline_name "nonpython-data-preprocessing-pipeline"
- To debug, run:
python -m debugpy --listen 5678 --wait-for-client ml_service/pipelines/build_data_processing_os_cmd_pipeline.py
In VSCode, create a launch configuration to attach to the debugger, and F5:
"configurations": [ { "name": "Python: Attach", "cwd": "${workspaceFolder}/samples/non-python-preprocess", "type": "python", "request": "attach", "connect": { "host": "localhost", "port": 5678 }, } ]
- Create an Azure DevOps variable group
nonpython-preprocess-aml-vg
that contains the following variables:
name | description |
---|---|
AML_COMPUTE_CLUSTER_NAME | Azure ML Compute cluster used for training |
RESOURCE_GROUP | Azure Resource Group where the Azure ML Workspace is located |
WORKSPACE_NAME | Azure ML Workspace name |
WORKSPACE_SVC_CONNECTION | Service Connection to Azure ML Workspace |
Note that you can also overwrite the variables defined in variables-template.yml with the ones defined in this variable group. Variables defined in the variable group takes precedence over variables-template.yml because of the order they are defined in Azure DevOps pipelines.
- Create the build agent
The build agent needs to run linting, unit tests, and call Azure ML SDK to publish Azure ML pipelines to process the data. To create a Docker image for the build agent, create and run a build pipeline with 00-build-agent-pipeline.yml.
- Create other pipelines
Create the remaining CI/CD pipelines defined in devops_pipelines folder. Verify or adjust their triggers if needed. By default, they are configured to trigger on pull requests or merging to main.