Name	Name	Last commit message	Last commit date
parent directory ..
.devcontainer	.devcontainer
devops_pipelines	devops_pipelines
local_development	local_development
ml_model	ml_model
ml_service	ml_service
.devcontainer.json	.devcontainer.json
README.md	README.md

Overview

Demonstrates how to run non Python custom code as a step in Azure ML pipelines. We've seen many cases where companies already have custom code they use to preprocess data before training a ML model. This sample also demonstrates how to use PyTest to mock Azure ML SDK for unit testing.

What does this sample demonstrate:

Run non Python tool to process Azure ML Datasets as a step in Azure ML pipeline, check return code and capture stdout.
Create unit tests by using a fixture for mocking Azure ML SDK.

What doesn't this sample demonstrate:

ML model training or deployment.

Run non Python tools to preprocess data as a step in Azure ML pipeline

This wrapper script calls the command line command. In the basic sample it is just a cp to move data from the input folder to the output folder of its AML pipeline step.

process = subprocess.Popen(['cp',
                            '{0}/.'.format(mount_context.mount_point),
                            step_output_path, '-r', '-v'],
                            stdout=subprocess.PIPE,
                            universal_newlines=True)

So it is possible to call any tool or program which can be executed on a ubuntu linux (which is the base image for AML pipeline steps). The tool(s) need to be installed in the custom container image. This Dockerfile is used to build a Docker image in the Azure ML pipeline.

It's important to know that the input folder is getting mounted within the wrapper script, so you can only work on the data after this code:

mount_context = dataset.mount()
mount_context.start()
print(f"mount_point is: {mount_context.mount_point}")

The mount point or folder is stored in this attribute mount_context.mount_point and can be used in the command line call. Similarly the output folder for this step is stored in step_output_path.

Example: Setup image preprocessing with ImageMagick

As an example on how to extend this template I will use this blog post about resizing images with ImageMagick.

Adding ImageMagick to the Dockerfile for the custom preprocessing step
Change the command line call in the Wrapper script
Rebuilding, publish and run the data_processing_os_cmd_pipeline

Adding ImageMagick to the Dockerfile for the custom preprocessing step

Adding the installation instruction apt-get install -y imagemagick && \ to the Dockerfile just before apt-get clean is called:

RUN apt-get update --fix-missing && \
    apt-get install -y wget bzip2 && \
    apt-get install -y fuse && \
    apt-get install -y imagemagick && \
    apt-get clean -y && \
    rm -rf /var/lib/apt/lists/*

Change the command line call in the Wrapper script

Assumption: Input pictures are all jpg and output pictures should be 100x100. Input dataset is http://download.tensorflow.org/example_images/flower_photos.tgz, only the subfolder daisy will be resized.

Changing the command to:

process = subprocess.Popen(['convert',
                           '{0}/daisy/*.jpg'.format(mount_context.mount_point),
                           '-resize',
                           '100x100!',
                           '{0}/resized.jpg'.format(step_output_path)],
                           stdout=subprocess.PIPE,
                           universal_newlines=True)

Unit test with Azure ML mocks

test_fixtures.py is an example of how to mock Azure ML SDK using pytest_mock. test_build_data_processing_os_cmd_pipeline.py uses the mocks to unit test Azure ML pipeline code.

Getting Started

Prerequisite

Whether you run this project locally or in Azure DevOps CI/CD pipelines, the code needs to get Azure ML context for remote or offline runs. Create Azure resources as documented here.
Review the folder structure explained here.

Running locally

Make a copy of .env.example, place it in the root of this sample, configure the variables, and rename the file to .env.
Use the VSCode dev container, or install Anaconda or Mini Conda and create a Conda envrionment by running local_install_requirements.sh.
In VSCode, open the root folder of this sample, select the Conda environment created above as the Python interpretor.

Publish and run Azure ML pipelines. Note that if you change the Dockerfile, set the variable AML_REBUILD_ENVIRONMENT in .env file to true for Azure ML to build an updated image.

To run the unit tests, open a terminal, activate the Conda environment for this sample, navigate to the root folder of this project, run

python -m pytest

To publish and run Azure ML pipelines, run:

# publish the Azure ML pipeline
python -m ml_service.pipelines.build_data_processing_os_cmd_pipeline
# run the Azure ML pipeline
python -m ml_service.pipelines.run_data_processing_pipeline --aml_pipeline_name "nonpython-data-preprocessing-pipeline"

To debug, run:

python -m debugpy --listen 5678 --wait-for-client ml_service/pipelines/build_data_processing_os_cmd_pipeline.py

In VSCode, create a launch configuration to attach to the debugger, and F5:

"configurations": [
  {
    "name": "Python: Attach",
    "cwd": "${workspaceFolder}/samples/non-python-preprocess",
    "type": "python",
    "request": "attach",
    "connect": {
      "host": "localhost",
      "port": 5678
    },
  }
]

CI/CD in Azure DevOps

Create an Azure DevOps variable group nonpython-preprocess-aml-vg that contains the following variables:

name	description
AML_COMPUTE_CLUSTER_NAME	Azure ML Compute cluster used for training
RESOURCE_GROUP	Azure Resource Group where the Azure ML Workspace is located
WORKSPACE_NAME	Azure ML Workspace name
WORKSPACE_SVC_CONNECTION	Service Connection to Azure ML Workspace

Note that you can also overwrite the variables defined in variables-template.yml with the ones defined in this variable group. Variables defined in the variable group takes precedence over variables-template.yml because of the order they are defined in Azure DevOps pipelines.

Create the build agent

The build agent needs to run linting, unit tests, and call Azure ML SDK to publish Azure ML pipelines to process the data. To create a Docker image for the build agent, create and run a build pipeline with 00-build-agent-pipeline.yml.

Create other pipelines

Create the remaining CI/CD pipelines defined in devops_pipelines folder. Verify or adjust their triggers if needed. By default, they are configured to trigger on pull requests or merging to main.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-python-preprocess

non-python-preprocess

README.md

Overview

Run non Python tools to preprocess data as a step in Azure ML pipeline

Example: Setup image preprocessing with ImageMagick

Unit test with Azure ML mocks

Getting Started

Prerequisite

Running locally

CI/CD in Azure DevOps

Files

non-python-preprocess

Directory actions

More options

Directory actions

More options

Latest commit

History

non-python-preprocess

Folders and files

parent directory

README.md

Overview

Run non Python tools to preprocess data as a step in Azure ML pipeline

Example: Setup image preprocessing with ImageMagick

Unit test with Azure ML mocks

Getting Started

Prerequisite

Running locally

CI/CD in Azure DevOps