diff --git a/doc/tutorials/sagemaker/sme_deploy_model.ipynb b/doc/tutorials/sagemaker/sme_deploy_model.ipynb new file mode 100644 index 00000000..e6546d87 --- /dev/null +++ b/doc/tutorials/sagemaker/sme_deploy_model.ipynb @@ -0,0 +1,331 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "65088350-a826-4103-b57f-26377bc967b8", + "metadata": {}, + "source": [ + "# Model deployment\n", + "\n", + "In this notebook we will deploy the binary classification model created in the [previous notebook](train_model.ipynb) to a real-time AWS SageMaker endpoint. We will then use the model to make predictions on a test dataset. Please refer to the SageMaker Extension User Guide for detailed description of this process.\n", + "\n", + "Important! Please make sure you perform the last step - deletion of the endpoint. Leaving the endpoint in the cloud will incur continuous charges by AWS.\n", + "\n", + "We will be running SQL queries using JupySQL SQL Magic.\n", + "\n", + "## Prerequisites\n", + "\n", + "Prior to using this notebook the following steps need to be completed:\n", + "1. [Configure the sandbox](../sendbox_config.ipynb).\n", + "2. [Initialize the SageMaker Extension](sme_init.ipynb).\n", + "3. [Load the MAGIC Gamma Telescope data](../data/data_telescope.ipynb).\n", + "4. [Train the model using SageMaker Autopilot](sme_train_model.ipynb).\n", + "\n", + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "383998b4-bf20-4dac-b1d3-e412be58cc9e", + "metadata": {}, + "outputs": [], + "source": [ + "from collections import UserDict\n", + "\n", + "class Secrets(UserDict):\n", + " \"\"\"This class mimics the Secret Store we will start using soon.\"\"\"\n", + "\n", + " def save(self, key: str, value: str) -> \"Secrets\":\n", + " self[key] = value\n", + " return self\n", + "\n", + "def get_value_as_attribute(self, key):\n", + " val = self.get(key)\n", + " if val is None:\n", + " raise AttributeError(f'{key} value is not defined')\n", + " return val\n", + "\n", + "Secrets.__getattr__ = get_value_as_attribute\n", + "\n", + "# For now just hardcode the configuration.\n", + "sb_config = Secrets({ \n", + " 'EXTERNAL_HOST_NAME': '192.168.124.93',\n", + " 'HOST_PORT': '8888',\n", + " 'USER': 'sys',\n", + " 'PASSWORD': 'exasol',\n", + " 'BUCKETFS_PORT': '6666',\n", + " 'BUCKETFS_USER': 'w',\n", + " 'BUCKETFS_PASSWORD': 'write',\n", + " 'BUCKETFS_USE_HTTPS': 'False',\n", + " 'BUCKETFS_SERVICE': 'bfsdefault',\n", + " 'BUCKETFS_BUCKET': 'default',\n", + " 'SCRIPT_LANGUAGE_NAME': 'PYTHON3_SME',\n", + " 'UDF_FLAVOR': 'python3-ds-EXASOL-6.0.0',\n", + " 'UDF_RELEASE': '20190116',\n", + " 'UDF_CLIENT': 'exaudfclient_py3',\n", + " 'SCHEMA': 'IDA',\n", + " 'AWS_KEY_ID': 'AKIASNN2LAKN3EYP2Y45',\n", + " 'AWS_ACCESS_KEY': 'ezgUx1qb1jaPZFyL4DyNXfdnd67a1r31zuZBRkvA',\n", + " 'AWS_REGION': 'eu-central-1',\n", + " 'AWS_ROLE': 'arn:aws:iam::166283903643:role/sagemaker-role',\n", + " 'AWS_BUCKET': 'ida-dataset-bucket',\n", + " 'AWS_CONN': 'MyAWSConn',\n", + " 'JOB_NAME': 'CLS20231102081841'\n", + "})\n", + "\n", + "EXTERNAL_HOST = f\"{sb_config.EXTERNAL_HOST_NAME}:{sb_config.HOST_PORT}\"\n", + "WEBSOCKET_URL = f\"exa+websocket://{sb_config.USER}:{sb_config.PASSWORD}\" \\\n", + " f\"@{EXTERNAL_HOST}/{sb_config.SCHEMA}?SSLCertificate=SSL_VERIFY_NONE\"\n", + "\n", + "S3_BUCKET_URI=f\"s3://{sb_config.AWS_BUCKET}\"" + ] + }, + { + "cell_type": "markdown", + "id": "bdce84ae-7702-4f3e-b196-a7affcc01182", + "metadata": {}, + "source": [ + "Let's bring up JupySQL and connect to the database via SQLAlchemy. Please refer to the documentation in the sqlalchemy-exasol for details on how to connect to the database using Exasol SQLAlchemy driver." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "53a337ec-4f54-4d60-bb13-cb0c1221221c", + "metadata": {}, + "outputs": [], + "source": [ + "from sqlalchemy import create_engine\n", + "\n", + "engine = create_engine(WEBSOCKET_URL)\n", + "\n", + "%load_ext sql\n", + "%sql engine" + ] + }, + { + "cell_type": "markdown", + "id": "4399dbb0-d2f0-471a-96d6-a02cdd7744ff", + "metadata": {}, + "source": [ + "## Deploy model to a SageMaker endpoint\n", + "\n", + "The script below deploys the best candidate model of the trained Autopilot job to an endpoint with specified name. The deployment SQL command additionally generates the prediction UDF script with the same name. This UDF can be used for making predictions in an SQL statement.\n", + "\n", + "\n", + "
Model deployment
\n", + "\n", + "Let's define some variables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f69e00ed-2cb2-4b7c-8eb0-a1a79f95e552", + "metadata": {}, + "outputs": [], + "source": [ + "# Endpoint name, also the name of the generated UDF script.\n", + "ENDPOINT_NAME = \"APSPredictor\"\n", + "\n", + "# The EC2 instance type of the endpoint to deploy the Autopilot model to.\n", + "INSTANCE_TYPE = \"ml.m5.large\"\n", + "\n", + "# The initial number of instances to run the endpoint on.\n", + "INSTANCE_COUNT = 1\n", + "\n", + "# Name of the table with the test data\n", + "TEST_TABLE_NAME = \"TELESCOPE_TEST\"\n", + "\n", + "# Name of the column in the test table which is the prediction target.\n", + "TARGET_COLUMN = \"CLASS\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "028a6784-26bc-4028-9009-32725c716e06", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "EXECUTE SCRIPT {{sb_config.SCHEMA}}.\"SME_DEPLOY_SAGEMAKER_AUTOPILOT_ENDPOINT\"(\n", + " '{{sb_config.JOB_NAME}}', \n", + " '{{ENDPOINT_NAME}}', \n", + " '{{sb_config.SCHEMA}}',\n", + " '{{INSTANCE_TYPE}}', \n", + " {{INSTANCE_COUNT}}, \n", + " '{{sb_config.AWS_CONN}}', \n", + " '{{sb_config.AWS_REGION}}'\n", + ");" + ] + }, + { + "cell_type": "markdown", + "id": "42578a73-8049-4751-9d6f-eca60579f6e1", + "metadata": {}, + "source": [ + "Let's check if the script has been created. We should be able to see an entry with the same name as the endpoint in the list of UDF scripts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9bde0b81-eb3c-4cde-8cdc-4384d976386d", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "SELECT SCRIPT_NAME, SCRIPT_TYPE \n", + "FROM SYS.EXA_ALL_SCRIPTS\n", + "WHERE SCRIPT_SCHEMA='{{sb_config.SCHEMA}}' AND SCRIPT_TYPE = 'UDF'" + ] + }, + { + "cell_type": "markdown", + "id": "26afbc93-e6c1-45f0-84e9-3f036ca9eabb", + "metadata": {}, + "source": [ + "## Make predictions via SageMaker endpoint\n", + "\n", + "Let's use the generated UDF script for making predictions on the test data.\n", + "\n", + "First we need to get a list of features to be passed to the UDF." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b939a986-3ddb-4e0a-a465-ec5c3167ced6", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql column_names <<\n", + "SELECT COLUMN_NAME\n", + "FROM SYS.EXA_ALL_COLUMNS\n", + "WHERE COLUMN_SCHEMA = '{{sb_config.SCHEMA}}' AND COLUMN_TABLE='{{TEST_TABLE_NAME}}' AND COLUMN_NAME <> UPPER('{{TARGET_COLUMN}}');" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "639757ff-ecc0-46ad-8e96-a936af5770a2", + "metadata": {}, + "outputs": [], + "source": [ + "column_names = ', '.join(f'[{name[0]}]' for name in column_names)" + ] + }, + { + "cell_type": "markdown", + "id": "966b1c84-6078-4923-ab79-4e214690261a", + "metadata": {}, + "source": [ + "Let's predict classes for the first 10 rows of the test table, just to see how the output of the UDF looks like. Remember that the first column in the input is reserved for the sample ID. Here we can just set it to zero." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "61ba6f6c-6244-46a9-b3ee-4196c2b7b9ca", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "SELECT \"{{sb_config.SCHEMA}}\".\"{{ENDPOINT_NAME}}\"(0, {{column_names}})\n", + "FROM \"{{sb_config.SCHEMA}}\".\"{{TEST_TABLE_NAME}}\"\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "id": "8370e769-b10b-4375-9b3f-ec0d13b372f1", + "metadata": {}, + "source": [ + "Now we will compute the confusion matrix for the test data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "422ece50-3706-4bcb-a2be-1736dbacd67f", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "WITH TEST_DATA AS\n", + "(\n", + " -- We take data from the test table and add the row number calling it SAMPLE_ID.\n", + " SELECT ROW_NUMBER() OVER () AS SAMPLE_ID, {{column_names}}, [{{TARGET_COLUMN}}] FROM \"{{sb_config.SCHEMA}}\".\"{{TEST_TABLE_NAME}}\"\n", + ")\n", + "WITH MODEL_OUTPUT AS\n", + "(\n", + " -- Make predictions. We will pass the SAMPLE_ID that sould be returned back unchanged.\n", + " SELECT \"{{sb_config.SCHEMA}}\".\"{{ENDPOINT_NAME}}\"(SAMPLE_ID, {{column_names}})\n", + " FROM TEST_DATA\n", + ")\n", + "-- Finally, compute the confusion matrix.\n", + "SELECT predictions, [{{TARGET_COLUMN}}], COUNT(*) as count\n", + "FROM MODEL_OUTPUT INNER JOIN TEST_DATA ON MODEL_OUTPUT.SAMPLE_ID = TEST_DATA.SAMPLE_ID\n", + "GROUP BY 1, 2;" + ] + }, + { + "cell_type": "markdown", + "id": "ab428be0-c90c-440c-ab78-05816998e222", + "metadata": {}, + "source": [ + "## Delete endpoint\n", + "\n", + "It is important to delete the endpoint once we finished working with it, to avoid unnecessary charges. The following script does that." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a12eebc0-f74c-4556-9be0-264d6d225abe", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "EXECUTE SCRIPT SME_DELETE_SAGEMAKER_AUTOPILOT_ENDPOINT(\n", + " '{{ENDPOINT_NAME}}', \n", + " '{{sb_config.AWS_CONN}}', \n", + " '{{sb_config.AWS_REGION}}'\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "2cf063c3-7afa-4853-9e61-915fcfc17d21", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this set of notebooks, we went through the steps required to train, deploy and use models based on the SageMaker Autopilot with the help of the Exasol SageMaker-Extension. The advantages the SageMaker-Extension provides include simple and fast uploading of training data into S3 buckets and getting predictions with SQL queries." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/doc/tutorials/sagemaker/sme_deployment.png b/doc/tutorials/sagemaker/sme_deployment.png new file mode 100644 index 00000000..dd1ff897 Binary files /dev/null and b/doc/tutorials/sagemaker/sme_deployment.png differ diff --git a/doc/tutorials/sagemaker/sme_init.ipynb b/doc/tutorials/sagemaker/sme_init.ipynb new file mode 100644 index 00000000..dbd72945 --- /dev/null +++ b/doc/tutorials/sagemaker/sme_init.ipynb @@ -0,0 +1,303 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "9d14c2ca-1191-4a01-b3fc-e246bc4d0504", + "metadata": {}, + "source": [ + "# SageMaker Extension initialization\n", + "\n", + "Here we will perform all the necessary steps to get the SageMaker Extension functionality up and running. Please refer to the SageMaker Extension User Guide for details on the required initialization steps. The extension module should have already been installed during the installation of this product, therefore the first step mentioned in the guide can be skipped.\n", + "\n", + "We will be running SQL queries using JupySQL SQL Magic and `pyexasol` module.\n", + "\n", + "## Prerequisites\n", + "\n", + "Prior to using this notebook one needs to complete the follow steps:\n", + "1. [Configure the sandbox](../sendbox_config.ipynb).\n", + "\n", + "## Set up" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec6e5c66-941d-401a-9542-e1231fb34a93", + "metadata": {}, + "outputs": [], + "source": [ + "from collections import UserDict\n", + "\n", + "class Secrets(UserDict):\n", + " \"\"\"This class mimics the Secret Store we will start using soon.\"\"\"\n", + "\n", + " def save(self, key: str, value: str) -> \"Secrets\":\n", + " self[key] = value\n", + " return self\n", + "\n", + "def get_value_as_attribute(self, key):\n", + " val = self.get(key)\n", + " if val is None:\n", + " raise AttributeError(f'{key} value is not defined')\n", + " return val\n", + "\n", + "Secrets.__getattr__ = get_value_as_attribute\n", + "\n", + "# For now just hardcode the configuration.\n", + "sb_config = Secrets({ \n", + " 'EXTERNAL_HOST_NAME': '192.168.124.93',\n", + " 'HOST_PORT': '8888',\n", + " 'USER': 'sys',\n", + " 'PASSWORD': 'exasol',\n", + " 'BUCKETFS_PORT': '6666',\n", + " 'BUCKETFS_USER': 'w',\n", + " 'BUCKETFS_PASSWORD': 'write',\n", + " 'BUCKETFS_USE_HTTPS': 'False',\n", + " 'BUCKETFS_SERVICE': 'bfsdefault',\n", + " 'BUCKETFS_BUCKET': 'default',\n", + " 'SCRIPT_LANGUAGE_NAME': 'PYTHON3_SME',\n", + " 'UDF_FLAVOR': 'python3-ds-EXASOL-6.0.0',\n", + " 'UDF_RELEASE': '20190116',\n", + " 'UDF_CLIENT': 'exaudfclient_py3',\n", + " 'SCHEMA': 'IDA'\n", + "})\n", + "\n", + "EXTERNAL_HOST = f\"{sb_config.EXTERNAL_HOST_NAME}:{sb_config.HOST_PORT}\"\n", + "SCRIPT_LANGUAGES = f\"{sb_config.SCRIPT_LANGUAGE_NAME}=localzmq+protobuf:///{sb_config.BUCKETFS_SERVICE}/\" \\\n", + " f\"{sb_config.BUCKETFS_BUCKET}/{sb_config.UDF_FLAVOR}?lang=python#buckets/{sb_config.BUCKETFS_SERVICE}/\" \\\n", + " f\"{sb_config.BUCKETFS_BUCKET}/{sb_config.UDF_FLAVOR}/exaudf/{sb_config.UDF_CLIENT}\";\n", + "WEBSOCKET_URL = f\"exa+websocket://{sb_config.USER}:{sb_config.PASSWORD}\" \\\n", + " f\"@{EXTERNAL_HOST}/{sb_config.SCHEMA}?SSLCertificate=SSL_VERIFY_NONE\"" + ] + }, + { + "cell_type": "markdown", + "id": "71245ff5-8788-463c-97f5-8e20f33fd909", + "metadata": {}, + "source": [ + "We will add some new variables specific to the SageMaker Extension." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d5dda5f1-317f-4b51-9204-d31f7b757be5", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "\n", + "# AWS access credentials\n", + "sb_config.save('AWS_KEY_ID', 'AKIASNN2LAKN3EYP2Y45')\n", + "sb_config.save('AWS_ACCESS_KEY', 'ezgUx1qb1jaPZFyL4DyNXfdnd67a1r31zuZBRkvA')\n", + "sb_config.save('AWS_REGION', 'eu-central-1')\n", + "sb_config.save('AWS_ROLE', 'arn:aws:iam::166283903643:role/sagemaker-role')\n", + "\n", + "# S3 bucket, which must exist\n", + "sb_config.save('AWS_BUCKET', 'ida-dataset-bucket')\n", + " \n", + "# Name of the AWS connection to be created in the database\n", + "sb_config.save('AWS_CONN', 'MyAWSConn')" + ] + }, + { + "cell_type": "markdown", + "id": "7381fa10-5117-474e-93a8-c3b108d0a55f", + "metadata": {}, + "source": [ + "Let's bring up JupySQL and connect to the database via SQLAlchemy. Please refer to the documentation in the sqlalchemy-exasol for details on how to connect to the database using Exasol SQLAlchemy driver." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "727a39dd-bf06-416e-acf5-90c0047328e9", + "metadata": {}, + "outputs": [], + "source": [ + "from sqlalchemy import create_engine\n", + "\n", + "engine = create_engine(WEBSOCKET_URL)\n", + "\n", + "%load_ext sql\n", + "%sql engine" + ] + }, + { + "cell_type": "markdown", + "id": "cc96d904-e54a-4abd-bcdd-fc55b96e97ca", + "metadata": {}, + "source": [ + "## Upload and activate the Script-Language-Container (SLC)\n", + "\n", + "We will start with loading the Script Language Container (SLC) specially built for the SageMaker Extension. The latest release of both the Extension and its SLC can be found here. We will use an http(s) client for that." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5c53041c-710e-4e6a-9ed8-89d376d40d33", + "metadata": {}, + "outputs": [], + "source": [ + "import tempfile\n", + "from stopwatch import Stopwatch\n", + "\n", + "# Get a temporary file name for the SLC.\n", + "_, tmp_file = tempfile.mkstemp(suffix='.tar.gz')\n", + "\n", + "# Download SLC.\n", + "stopwatch = Stopwatch()\n", + "download_command = f'curl -L -o {tmp_file} https://github.com/exasol/sagemaker-extension/releases/download/0.5.0/' \\\n", + " f'exasol_sagemaker_extension_container-release-CYEVORMGO3X5JZJZTXFLS23FZYKIKDG7MVNUSSJK6FUST5WRPZUQ.tar.gz'\n", + "! {download_command}\n", + "print(f\"Downloading the SLC took: {stopwatch}\")\n", + "\n", + "# Upload SLC into the BucketFS\n", + "stopwatch = Stopwatch()\n", + "bfs_url_prefix = \"https://\" if sb_config.BUCKETFS_USE_HTTPS.lower() == 'true' else \"http://\"\n", + "bfs_host = f'{sb_config.EXTERNAL_HOST_NAME}:{sb_config.BUCKETFS_PORT}'\n", + "upload_command = f'curl {bfs_url_prefix}{sb_config.BUCKETFS_USER}:{sb_config.BUCKETFS_PASSWORD}' \\\n", + " f'@{bfs_host}/{sb_config.BUCKETFS_BUCKET}/{sb_config.UDF_FLAVOR}.tar.gz --upload-file {tmp_file}'\n", + "! {upload_command}\n", + "print(f\"Uploading the SLC took: {stopwatch}\")\n", + "\n", + "# Delete SLC file on the local drive.\n", + "! rm {tmp_file}" + ] + }, + { + "cell_type": "markdown", + "id": "eab7332a-a78a-4411-b567-e9734556e35b", + "metadata": {}, + "source": [ + "We need to activate the uploaded SLC by updating the system parameter `SCRIPT_LANGUAGES`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cb68aa44-e9ad-4609-b347-158901632e2f", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "ALTER SYSTEM SET SCRIPT_LANGUAGES='{{SCRIPT_LANGUAGES}}';\n", + "ALTER SESSION SET SCRIPT_LANGUAGES='{{SCRIPT_LANGUAGES}}';" + ] + }, + { + "cell_type": "markdown", + "id": "fbfc9c58-924c-4007-8c67-6af7165f6354", + "metadata": {}, + "source": [ + "## Create objects in the database.\n", + "### Scripts\n", + "Once the SLC is installed we can upload all the required scripts into the database. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e4d40cc4-d3b7-4abc-94d7-0b81e3a57f07", + "metadata": {}, + "outputs": [], + "source": [ + "deploy_command = f\"\"\"\n", + "python -m exasol_sagemaker_extension.deployment.deploy_cli \\\n", + " --host {sb_config.EXTERNAL_HOST_NAME} \\\n", + " --port {sb_config.HOST_PORT} \\\n", + " --user {sb_config.USER} \\\n", + " --pass {sb_config.PASSWORD} \\\n", + " --schema {sb_config.SCHEMA}\n", + "\"\"\"\n", + "\n", + "print(deploy_command)\n", + "!{deploy_command}" + ] + }, + { + "cell_type": "markdown", + "id": "c0eb9568-dd20-4806-b02b-11b3f3065cf6", + "metadata": {}, + "source": [ + "Let's verify that the scripts have been created. We should see 4 new UDF scripts and 4 new Lua scripts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d13d3754-1015-405a-81e3-76d6a61bc7a4", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "SELECT SCRIPT_NAME, SCRIPT_TYPE FROM SYS.EXA_ALL_SCRIPTS WHERE SCRIPT_SCHEMA='{{sb_config.SCHEMA}}'" + ] + }, + { + "cell_type": "markdown", + "id": "8ae7837a-c167-4f70-ad84-7e391bc927f4", + "metadata": {}, + "source": [ + "### AWS connection\n", + "\n", + "The SageMaker Extension needs to connect to AWS SageMaker and our AWS S3 bucket. For that, it needs AWS credentials with Sagemaker Execution permissions. The required credentials are AWS Access Key (Please check how to create an access key).\n", + "\n", + "In order for the SageMaker-Extension to use the Access Key we need to create an Exasol CONNECTION object which securely stores the keys. For more information, please check Exasol documentation on how to create a connection." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ae6c14ae-ea9d-4be7-ae46-89a1d75607ae", + "metadata": {}, + "outputs": [], + "source": [ + "import pyexasol\n", + "\n", + "sql = f\"\"\"\n", + "CREATE OR REPLACE CONNECTION [{sb_config.AWS_CONN}]\n", + " TO 'https://{sb_config.AWS_BUCKET}.s3.{sb_config.AWS_REGION}.amazonaws.com/'\n", + " USER {{AWS_KEY_ID!s}}\n", + " IDENTIFIED BY {{AWS_ACCESS_KEY!s}}\n", + "\"\"\"\n", + "query_params = {\n", + " \"AWS_KEY_ID\": sb_config.AWS_KEY_ID, \n", + " \"AWS_ACCESS_KEY\": sb_config.AWS_ACCESS_KEY\n", + "}\n", + "with pyexasol.connect(dsn=EXTERNAL_HOST, user=sb_config.USER, password=sb_config.PASSWORD, compression=True) as conn:\n", + " conn.execute(query=sql, query_params=query_params)" + ] + }, + { + "cell_type": "markdown", + "id": "326d3ff1-d6ae-419d-926b-a7c129b76dc5", + "metadata": {}, + "source": [ + "Now we are ready to start training a model. We will do this in the [following](sme_train_model.ipynb) notebook." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/doc/tutorials/sagemaker/sme_introduction.ipynb b/doc/tutorials/sagemaker/sme_introduction.ipynb new file mode 100644 index 00000000..96747a97 --- /dev/null +++ b/doc/tutorials/sagemaker/sme_introduction.ipynb @@ -0,0 +1,74 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "3d4712f3-3f71-487a-a8a4-5765efbb4cc7", + "metadata": {}, + "source": [ + "# Introduction\n", + "\n", + "The Exasol Sagemaker Extension allows developing an end-to-end machine learning project on data stored in Exasol using the AWS SageMaker Autopilot service.\n", + "\n", + "### AWS Sagemaker Autopilot Service\n", + "\n", + "AWS SageMaker is an AWS public cloud service in which users can build and deploy machine learning models. SageMaker provides a number of levels of abstraction for machine learning models development. At one of the its highest level of abstraction, SageMaker enables users to use an Automated machine learning (AutoML) service, called Autopilot in AWS, that automates the process of applying machine learning to real world problems.\n", + "\n", + "Autopilot covers a complete pipeline of developing an end-to end machine learning project, from raw data to a deployable model. It is able to automatically build, train and tune a number of machine learning models by inspecting the input data set. In this way, the following tasks, which are repeatedly applied by ML-experts in machine learning projects, are automated:\n", + "* Pre-process and clean the data.\n", + "* Perform feature engineering, selecting the most appropriate features.\n", + "* Determine the most appropriate ML algorithm.\n", + "* Tune and optimize model hyper-parameters.\n", + "* Post-process machine learning models.\n", + "\n", + "The Exasol Sagemaker Extension takes these advantages of AWS Autopilot and enables users to easily create an effective and efficient machine learning models without expert knowledge.\n", + "\n", + "### Exasol SageMaker Extension\n", + "\n", + "The Exasol Sagemaker Extension provides a Python library together with Exasol Scripts and UDFs that train Machine Learning Models on data stored in Exasol using AWS SageMaker Autopilot service.\n", + "\n", + "The extension exports a given Exasol table into AWS S3, and then triggers Machine Learning training using the AWS Autopilot service. It provides a script for polling training status. In order to perform predictions on a trained Autopilot model, one of the methods is to deploy the model to the real-time AWS endpoint. This extension provides Lua scripts for creating/deleting real-time endpoint and creates a model-specific UDF script for making real-time predictions. Here is the schematic picture of the solution.\n", + "\n", + "\n", + "
Solution overview
" + ] + }, + { + "cell_type": "markdown", + "id": "1d5c821c-5ef5-469c-a1fc-cb845390a8c3", + "metadata": {}, + "source": [ + "The extension requires a number of initialization steps.\n", + "Please proceed to the [extension initialization](sme_int.ipynb) notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4ef82228-2a0d-45fd-9ab1-3ba3e580839f", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/doc/tutorials/sagemaker/sme_overview.png b/doc/tutorials/sagemaker/sme_overview.png new file mode 100644 index 00000000..39ea5ee7 Binary files /dev/null and b/doc/tutorials/sagemaker/sme_overview.png differ diff --git a/doc/tutorials/sagemaker/sme_train_model.ipynb b/doc/tutorials/sagemaker/sme_train_model.ipynb new file mode 100644 index 00000000..37ecb931 --- /dev/null +++ b/doc/tutorials/sagemaker/sme_train_model.ipynb @@ -0,0 +1,399 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "4c42f7dd-9bbe-4c68-a678-d233a7bb30e8", + "metadata": {}, + "source": [ + "# Model training\n", + "\n", + "In this notebook we are going to train a binary classification model using AWS SageMaker ML Autopilot. The SageMaker Extension provides a script that starts this process. It uploads the training data into the selected S3 bucket, then creates and starts the Autopilot job. Please refer to the Extension User Guide for detailed description of the service.\n", + "\n", + "We will be running SQL queries using JupySQL SQL Magic.\n", + "\n", + "## Prerequisites\n", + "\n", + "Prior to using this notebook the following steps need to be completed:\n", + "1. [Configure the sandbox](../sendbox_config.ipynb).\n", + "2. [Initialize the SageMaker Extension](sme_init.ipynb).\n", + "3. [Load the MAGIC Gamma Telescope data](../data/data_telescope.ipynb).\n", + "\n", + "## Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "44b219df-e105-487f-b1e5-a6ba8ba0b283", + "metadata": {}, + "outputs": [], + "source": [ + "#TODO: start using the secret store.\n", + "\n", + "from collections import UserDict\n", + "\n", + "class Secrets(UserDict):\n", + " \"\"\"This class mimics the Secret Store we will start using soon.\"\"\"\n", + "\n", + " def save(self, key: str, value: str) -> \"Secrets\":\n", + " self[key] = value\n", + " return self\n", + "\n", + "def get_value_as_attribute(self, key):\n", + " val = self.get(key)\n", + " if val is None:\n", + " raise AttributeError(f'{key} value is not defined')\n", + " return val\n", + "\n", + "Secrets.__getattr__ = get_value_as_attribute\n", + "\n", + "# For now just hardcode the configuration.\n", + "sb_config = Secrets({ \n", + " 'EXTERNAL_HOST_NAME': '192.168.124.93',\n", + " 'HOST_PORT': '8888',\n", + " 'USER': 'sys',\n", + " 'PASSWORD': 'exasol',\n", + " 'BUCKETFS_PORT': '6666',\n", + " 'BUCKETFS_USER': 'w',\n", + " 'BUCKETFS_PASSWORD': 'write',\n", + " 'BUCKETFS_USE_HTTPS': 'False',\n", + " 'BUCKETFS_SERVICE': 'bfsdefault',\n", + " 'BUCKETFS_BUCKET': 'default',\n", + " 'SCRIPT_LANGUAGE_NAME': 'PYTHON3_SME',\n", + " 'UDF_FLAVOR': 'python3-ds-EXASOL-6.0.0',\n", + " 'UDF_RELEASE': '20190116',\n", + " 'UDF_CLIENT': 'exaudfclient_py3',\n", + " 'SCHEMA': 'IDA',\n", + " 'AWS_KEY_ID': 'AKIASNN2LAKN3EYP2Y45',\n", + " 'AWS_ACCESS_KEY': 'ezgUx1qb1jaPZFyL4DyNXfdnd67a1r31zuZBRkvA',\n", + " 'AWS_REGION': 'eu-central-1',\n", + " 'AWS_ROLE': 'arn:aws:iam::166283903643:role/sagemaker-role',\n", + " 'AWS_BUCKET': 'ida-dataset-bucket',\n", + " 'AWS_CONN': 'MyAWSConn'\n", + "})\n", + "\n", + "EXTERNAL_HOST = f\"{sb_config.EXTERNAL_HOST_NAME}:{sb_config.HOST_PORT}\"\n", + "WEBSOCKET_URL = f\"exa+websocket://{sb_config.USER}:{sb_config.PASSWORD}\" \\\n", + " f\"@{EXTERNAL_HOST}/{sb_config.SCHEMA}?SSLCertificate=SSL_VERIFY_NONE\"\n", + "\n", + "S3_BUCKET_URI=f\"s3://{sb_config.AWS_BUCKET}\"" + ] + }, + { + "cell_type": "markdown", + "id": "81a6e4ce-f05e-4810-bb68-356d63042ad4", + "metadata": {}, + "source": [ + "We need a new unique job name. We will make it up from the timestamp." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e18e8285-9794-45cb-bfaa-d8e1f6fe45b3", + "metadata": {}, + "outputs": [], + "source": [ + "from datetime import datetime\n", + "sb_config.save('JOB_NAME', 'CLS' + datetime.now().strftime('%Y%m%d%H%M%S'))\n", + "\n", + "# Here is the job name we are going to use in this and the following notebooks.\n", + "sb_config.JOB_NAME" + ] + }, + { + "cell_type": "markdown", + "id": "f5e952b4-8724-46f9-a15a-7de2f306883f", + "metadata": {}, + "source": [ + "Let's bring up JupySQL and connect to the database via SQLAlchemy. Please refer to the documentation of sqlalchemy-exasol for details on how to connect to the database using Exasol SQLAlchemy driver." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ae1c336a-6461-4d32-a448-160fc72baedb", + "metadata": {}, + "outputs": [], + "source": [ + "from sqlalchemy import create_engine\n", + "\n", + "engine = create_engine(WEBSOCKET_URL)\n", + "\n", + "%load_ext sql\n", + "%sql engine" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "e0753272-319f-4745-bb05-74922a4d2379", + "metadata": {}, + "source": [ + "## Start training\n", + "\n", + "Let's define few variables for our experiment.\n", + "\n", + "Note that the path for input data should be unique for each experiment.. Alternatively, all data files should be cleared after the experiment is finished. Currently this has to be done manually. The Autopilot will be using all data files found in this directory. If it contains stale files from previous experiments then at best the training pipeline will fail. Or worse, a wrong model will be built." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65fb26cd-17a9-4faa-bf47-b3ec63e0be1d", + "metadata": {}, + "outputs": [], + "source": [ + "# Path in the S3 bucket where the input data will be uploaded.\n", + "S3_OUTPUT_PATH = \"ida_dataset_path\"\n", + "\n", + "# Input table name.\n", + "INPUT_TABLE_NAME = \"TELESCOPE_TRAIN\"\n", + "\n", + "# Name of the view extending input table (see below why it is necessary).\n", + "INPUT_VIEW_NAME = \"Z_\" + INPUT_TABLE_NAME\n", + "\n", + "# Name of the column in the input table which is the prediction target.\n", + "TARGET_COLUMN = \"CLASS\"\n", + "\n", + "# The maximum number of model candidates.\n", + "MAX_CANDIDATES = 2" + ] + }, + { + "cell_type": "markdown", + "id": "9f2d7ae3-6ab9-4872-9f7f-a0b10fd91f5b", + "metadata": {}, + "source": [ + "### Prepare data\n", + "\n", + "When we use our model for making batch predictions we will need to identify samples in the batch. This is because the order of labeled samples in the output may not match the order of unlabeled samples in the input. For that purpose we will extend features by adding an artificial column that will be a placeholder for a sample ID. During model training we will set this column to a constant value. This should make it non-influential for the prediction.\n", + "\n", + "Future versions of the SageMaker Extension are expected to be doing this step for us.\n", + "\n", + "First, we need to get a list of features." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a79c1344-7c55-4c70-bc3c-20da40c25646", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql column_names <<\n", + "SELECT COLUMN_NAME\n", + "FROM SYS.EXA_ALL_COLUMNS\n", + "WHERE COLUMN_SCHEMA = '{{sb_config.SCHEMA}}' AND COLUMN_TABLE='{{INPUT_TABLE_NAME}}'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aec03337-68df-44e7-89ea-d59f76ce0886", + "metadata": {}, + "outputs": [], + "source": [ + "column_names = ', '.join(f'[{name[0]}]' for name in column_names)" + ] + }, + { + "cell_type": "markdown", + "id": "8aac423e-ba2a-45a5-93aa-6ec4e78d4e61", + "metadata": {}, + "source": [ + "Now let's create a view extending the input table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41838507-4b7b-4312-8f1e-8303e75e3d62", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "CREATE OR REPLACE VIEW {{sb_config.SCHEMA}}.\"{{INPUT_VIEW_NAME}}\" AS\n", + "SELECT CAST(0 AS INT) AS SAMPLE_ID, {{column_names}} FROM {{INPUT_TABLE_NAME}}" + ] + }, + { + "cell_type": "markdown", + "id": "c0f6199a-47c1-4321-906d-dd5d88956bf3", + "metadata": {}, + "source": [ + "### Create Autopilot job\n", + "\n", + "The script below exports the data to AWS S3 bucket. This export operation is highly efficient, as it is performed in parallel. After that it calls Amazon SageMaker Autopilot, which automatically performs an end-to end machine learning development, to build a model. The script doesn't wait till the training is completed. That may take a while. The next script will allows us to monitor the progress of the Autopilot training pipeline.\n", + "\n", + "\n", + "
Model training with Autopilot
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25609a4d-d0b9-4e3c-abd8-3543fe49b045", + "metadata": {}, + "outputs": [], + "source": [ + "%config SqlMagic.named_parameters=True" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1766a29b-7b51-44ec-9a72-0024366a4518", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "EXECUTE SCRIPT \"{{sb_config.SCHEMA}}\".\"SME_TRAIN_WITH_SAGEMAKER_AUTOPILOT\"(\n", + "'{\n", + " \"job_name\" : \"{{sb_config.JOB_NAME}}\",\n", + " \"aws_credentials_connection_name\" : \"{{sb_config.AWS_CONN}}\",\n", + " \"aws_region\" : \"{{sb_config.AWS_REGION}}\",\n", + " \"iam_sagemaker_role\" : \"{{sb_config.AWS_ROLE}}\",\n", + " \"s3_bucket_uri\" : \"{{S3_BUCKET_URI}}\",\n", + " \"s3_output_path\" : \"{{S3_OUTPUT_PATH}}\",\n", + " \"input_schema_name\" : \"{{sb_config.SCHEMA}}\",\n", + " \"input_table_or_view_name\" : \"{{INPUT_VIEW_NAME}}\",\n", + " \"target_attribute_name\" : \"{{TARGET_COLUMN}}\",\n", + " \"max_candidates\" : {{MAX_CANDIDATES}}\n", + "}')" + ] + }, + { + "cell_type": "markdown", + "id": "0b96dfed-bfd1-4a2e-9721-f05a67347308", + "metadata": {}, + "source": [ + "We don't need the input view anymore since the data has been uploaded into an S3 bucket. Let's delete it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8ac19655-5ac1-4f88-8ff1-d8df8fb5dc9f", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "DROP VIEW {{sb_config.SCHEMA}}.\"{{INPUT_VIEW_NAME}}\"" + ] + }, + { + "cell_type": "markdown", + "id": "7c82260f-64f5-441e-a5cb-f0b65eea4649", + "metadata": {}, + "source": [ + "## Poll training status\n", + "\n", + "As it was mentioned above, the model training runs asynchronously. We can monitor its progress by polling the Autopilot job status. Please call this script periodically until you see the status as Completed. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e6b74a02-4e2d-4a96-be01-cf9658976472", + "metadata": {}, + "outputs": [], + "source": [ + "%%sql\n", + "EXECUTE SCRIPT {{sb_config.get(\"SCHEMA\")}}.\"SME_POLL_SAGEMAKER_AUTOPILOT_JOB_STATUS\"(\n", + " '{{sb_config.JOB_NAME}}',\n", + " '{{sb_config.AWS_CONN}}',\n", + " '{{sb_config.AWS_REGION}}'\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "50e772fc-6d58-4e51-9aa2-e4230ded8f08", + "metadata": {}, + "source": [ + "Once the job status becomes `Completed` the model is ready to be deployed and used for prediction. This will be demonstrated in the [next notebook](sme_deploy_model.ipynb)." + ] + }, + { + "cell_type": "markdown", + "id": "708fc13e-51b9-4910-bcea-10d73e6bcfa0", + "metadata": {}, + "source": [ + "## Troubleshoot the job\n", + "\n", + "If the job fails the code below may help with troubleshooting. It prints detailed description of the job status including reason for failure." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "53fde837-3e50-4c9d-b094-aac987b04575", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from sagemaker import AutoML\n", + "\n", + "os.environ[\"AWS_DEFAULT_REGION\"] = sb_config.AWS_REGION\n", + "os.environ[\"AWS_ACCESS_KEY_ID\"] = sb_config.AWS_KEY_ID\n", + "os.environ[\"AWS_SECRET_ACCESS_KEY\"] = sb_config.AWS_ACCESS_KEY\n", + "\n", + "automl = AutoML.attach(auto_ml_job_name=sb_config.JOB_NAME)\n", + "automl.describe_auto_ml_job()" + ] + }, + { + "cell_type": "markdown", + "id": "66a0992f-f9eb-4003-8b29-46afc8bcff97", + "metadata": {}, + "source": [ + "Another hint is to check that the input data has been uploaded to the S3 bucket correctly. Generally, the data will be split into a number of batches. The following command will print a list of csv files, one per batch. The name of a file is made of the name of the input data view and the batch number. There should be no other files in the input data directory.\n", + "\n", + "The files can be inspected further by downloading then to a local machine with `aws s3 cp` command.\n", + "\n", + "We assume that the required environment variables have been set when executing the previous cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1fc43fbb-3e32-4e89-b333-4bde25304e12", + "metadata": {}, + "outputs": [], + "source": [ + "aws_command = f'aws s3 ls s3://{sb_config.AWS_BUCKET}/{S3_OUTPUT_PATH} --recursive'\n", + "!{aws_command}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe278a56-151c-42e3-bc0c-24bf739ffc62", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/doc/tutorials/sagemaker/sme_training.png b/doc/tutorials/sagemaker/sme_training.png new file mode 100644 index 00000000..0cb48e99 Binary files /dev/null and b/doc/tutorials/sagemaker/sme_training.png differ