Update instruction (#2847)

Azure · Nov 21, 2023 · 35272c8 · 35272c8
1 parent 3ac70fb
commit 35272c8
Show file tree

Hide file tree

Showing 8 changed files with 70 additions and 104 deletions.
diff --git a/...estore_sample/featurestore/featuresets/transactions/featureset_asset_offline_enabled.yaml b/...estore_sample/featurestore/featuresets/transactions/featureset_asset_offline_enabled.yaml
@@ -17,4 +17,5 @@ materialization_settings:
     spark.driver.memory: 36g
     spark.executor.cores: 4
     spark.executor.instances: 2
-    spark.executor.memory: 36g
+    spark.executor.memory: 36g
+    spark.sql.shuffle.partitions: 1
diff --git a/...featurestore/featuresets/transactions/featureset_asset_offline_enabled_with_schedule.yaml b/...featurestore/featuresets/transactions/featureset_asset_offline_enabled_with_schedule.yaml
@@ -27,4 +27,5 @@ materialization_settings:
     spark.driver.memory: 36g
     spark.executor.cores: 4
     spark.executor.instances: 2
-    spark.executor.memory: 36g
+    spark.executor.memory: 36g
+    spark.sql.shuffle.partitions: 1
diff --git a/...ebooks/sdk_and_cli/1. Develop a feature set and register with managed feature store.ipynb b/...ebooks/sdk_and_cli/1. Develop a feature set and register with managed feature store.ipynb
@@ -90,31 +90,28 @@
                 "\n",
                 "To prepare the notebook environment for development:\n",
                 "\n",
-                "1. Clone the [azureml-examples](https://github.com/azure/azureml-examples) repository to your local GitHub resources with this command:\n",
+                "1. In the Azure Machine Learning studio environment, select Notebooks on the left pane, and then select the Samples tab.\n",
                 "\n",
-                "   `git clone --depth 1 https://github.com/Azure/azureml-examples`\n",
+                "1. Browse to the featurestore_sample directory (select **Samples** > **SDK v2** > **sdk** > **python** > **featurestore_sample**), and then select **Clone**.\n",
                 "\n",
-                "   You can also download a zip file from the [azureml-examples](https://github.com/azure/azureml-examples) repository. At this page, first select the `code` dropdown, and then select `Download ZIP`. Then, unzip the contents into a folder on your local device.\n",
-                "\n",
-                "1. Upload the feature store samples directory to the project workspace\n",
-                "\n",
-                "   1. In the Azure Machine Learning workspace, open the Azure Machine Learning studio UI.\n",
-                "   1. Select **Notebooks** in left navigation panel.\n",
-                "   1. Select your user name in the directory listing.\n",
-                "   1. Select ellipses (**...**) and then select **Upload folder**.\n",
-                "   1. Select the feature store samples folder from the cloned directory path: `azureml-examples/sdk/python/featurestore-sample`.\n",
+                "1. The **Select target directory** panel opens. Select the **Users** directory and then select _your user name_, and then select **Clone**.\n",
                 "\n",
                 "1. Run the tutorial\n",
                 "\n",
                 "   * Option 1: Create a new notebook, and execute the instructions in this document, step by step.\n",
                 "   * Option 2: Open existing notebook `featurestore_sample/notebooks/sdk_and_cli/1. Develop a feature set and register with managed feature store.ipynb`. You may keep this document open and refer to it for more explanation and documentation links.\n",
                 "\n",
-                "       1. Select **Serverless Spark Compute** in the top navigation **Compute** dropdown. This operation might take one to two minutes. Wait for a status bar in the top to display **Configure session**.\n",
-                "       1. Select **Configure session** in the top status bar.\n",
-                "       1. Select **Python packages**.\n",
-                "       1. Select **Upload conda file**.\n",
-                "       1. Select file `azureml-examples/sdk/python/featurestore-sample/project/env/conda.yml` located on your local device.\n",
-                "       1. (Optional) Increase the session time-out (idle time in minutes) to reduce the serverless spark cluster startup time.\n",
+                "1. To configure the notebook environment, you must upload the conda.yml file\n",
+                "\n",
+                "   1. Select **Notebooks** on the left pane, and then select the **Files** tab.\n",
+                "   1. Browse to the *env* directory (select **Users** > *your_user_name* > **featurestore_sample** > **project** > **env**), and then select the conda.yml file.\n",
+                "   1. Select **Download**\n",
+                "   1. Select **Serverless Spark Compute** in the top navigation **Compute** dropdown. This operation might take one to two minutes. Wait for a status bar in the top to display **Configure session**.\n",
+                "   1. Select **Configure session** in the top status bar.\n",
+                "   1. Select **Python packages**.\n",
+                "   1. Select **Upload conda file**.\n",
+                "   1. Select the `conda.yml` you downloaded on your local device.\n",
+                "   1. (Optional) Increase the session time-out (idle time in minutes) to reduce the serverless spark cluster startup time.\n",
                 "\n",
                 "__Important:__ Except for this step, you need to run all the other steps every time you have a new spark session/session time out.\n"
             ]
@@ -1108,7 +1105,13 @@
             },
             "source": [
                 "## Step 6: Enable offline materialization on transactions feature set\n",
-                "Once materialization is enabled on a feature set, you can perform backfill (this tutorial) or schedule recurrent materialization jobs (shown in a later tutorial)."
+                "Once materialization is enabled on a feature set, you can perform backfill (this tutorial) or schedule recurrent materialization jobs (shown in a later tutorial).\n",
+                "\n",
+                "#### Set spark.sql.shuffle.partitions in the yaml file according to the feature data size\n",
+                "\n",
+                "The spark configuration `spark.sql.shuffle.partitions` is an OPTIONAL parameter that can affect the number of parquet files generated (per day) when the feature set is materialized into the offline store. The default value of this parameter is 200. The best practice is to avoid generating many small parquet files. If offline feature retrieval turns out to become slow after the feature set is materialized, please go to the corresponding folder in the offline store to check whether it is the issue of having too many small parquet files (per day), and adjust the value of this parameter accordingly.\n",
+                "\n",
+                "*Note: The sample data used in this notebook is small. So this parameter is set to 1 in the featureset_asset_offline_enabled.yaml file.*"
             ]
         },
         {

diff --git a/...e/notebooks/sdk_and_cli/3. Enable recurrent materialization and run batch inference.ipynb b/...e/notebooks/sdk_and_cli/3. Enable recurrent materialization and run batch inference.ipynb
@@ -566,7 +566,7 @@
         "    name=\"azureml_0a7417c8-409a-4536-a069-4ea23a08ebfe_output_data_data_with_prediction\",\n",
         "    version=\"1\",\n",
         ")\n",
-        "inf_output_df = spark.read.parquet(inf_data_output.path)\n",
+        "inf_output_df = spark.read.parquet(inf_data_output.path + \"data/*.parquet\")\n",
         "display(inf_output_df.head(5))"
       ]
     },

diff --git a/...notebooks/sdk_only/1. Develop a feature set and register with managed feature store.ipynb b/...notebooks/sdk_only/1. Develop a feature set and register with managed feature store.ipynb
@@ -57,7 +57,7 @@
     "Before following the steps in this article, make sure you have the following prerequisites:\n",
     "\n",
     "* An Azure Machine Learning workspace. If you don't have one, use the steps in the [Quickstart: Create workspace resources](https://learn.microsoft.com/en-us/azure/machine-learning/quickstart-create-resources?view=azureml-api-2) article to create one.\n",
-    "* To perform the steps in this article, your user account must be assigned the owner or contributor role to a resource group where the feature store will be created"
+    "* To perform the steps in this article, your user account must be assigned the owner role to a resource group where the feature store will be created"
    ]
   },
   {
@@ -79,31 +79,28 @@
     "\n",
     "To prepare the notebook environment for development:\n",
     "\n",
-    "1. Clone the [azureml-examples](https://github.com/azure/azureml-examples) repository to your local GitHub resources with this command:\n",
+    "1. In the Azure Machine Learning studio environment, select Notebooks on the left pane, and then select the Samples tab.\n",
     "\n",
-    "   `git clone --depth 1 https://github.com/Azure/azureml-examples`\n",
+    "1. Browse to the featurestore_sample directory (select **Samples** > **SDK v2** > **sdk** > **python** > **featurestore_sample**), and then select **Clone**.\n",
     "\n",
-    "   You can also download a zip file from the [azureml-examples](https://github.com/azure/azureml-examples) repository. At this page, first select the `code` dropdown, and then select `Download ZIP`. Then, unzip the contents into a folder on your local device.\n",
-    "\n",
-    "1. Upload the feature store samples directory to the project workspace\n",
-    "\n",
-    "   1. In the Azure Machine Learning workspace, open the Azure Machine Learning studio UI.\n",
-    "   1. Select **Notebooks** in left navigation panel.\n",
-    "   1. Select your user name in the directory listing.\n",
-    "   1. Select ellipses (**...**) and then select **Upload folder**.\n",
-    "   1. Select the feature store samples folder from the cloned directory path: `azureml-examples/sdk/python/featurestore-sample`.\n",
+    "1. The **Select target directory** panel opens. Select the **Users** directory and then select _your user name_, and then select **Clone**.\n",
     "\n",
     "1. Run the tutorial\n",
     "\n",
     "   * Option 1: Create a new notebook, and execute the instructions in this document, step by step.\n",
     "   * Option 2: Open existing notebook `featurestore_sample/notebooks/sdk_only/1. Develop a feature set and register with managed feature store.ipynb`. You may keep this document open and refer to it for more explanation and documentation links.\n",
     "\n",
-    "       1. Select **Serverless Spark Compute** in the top navigation **Compute** dropdown. This operation might take one to two minutes. Wait for a status bar in the top to display **Configure session**.\n",
-    "       1. Select **Configure session** in the top status bar.\n",
-    "       1. Select **Python packages**.\n",
-    "       1. Select **Upload conda file**.\n",
-    "       1. Select file `azureml-examples/sdk/python/featurestore-sample/project/env/conda.yml` located on your local device.\n",
-    "       1. (Optional) Increase the session time-out (idle time in minutes) to reduce the serverless spark cluster startup time.\n",
+    "1. To configure the notebook environment, you must upload the conda.yml file\n",
+    "\n",
+    "   1. Select **Notebooks** on the left pane, and then select the **Files** tab.\n",
+    "   1. Browse to the *env* directory (select **Users** > *your_user_name* > **featurestore_sample** > **project** > **env**), and then select the conda.yml file.\n",
+    "   1. Select **Download**\n",
+    "   1. Select **Serverless Spark Compute** in the top navigation **Compute** dropdown. This operation might take one to two minutes. Wait for a status bar in the top to display **Configure session**.\n",
+    "   1. Select **Configure session** in the top status bar.\n",
+    "   1. Select **Python packages**.\n",
+    "   1. Select **Upload conda file**.\n",
+    "   1. Select the `conda.yml` you downloaded on your local device.\n",
+    "   1. (Optional) Increase the session time-out (idle time in minutes) to reduce the serverless spark cluster startup time.\n",
     "\n",
     "__Important:__ Except for this step, you need to run all the other steps every time you have a new spark session/session time out.\n"
    ]
@@ -1047,7 +1044,13 @@
    },
    "source": [
     "## Step 6: Enable offline materialization on transactions feature set\n",
-    "Once materialization is enabled on a feature set, you can perform backfill (this tutorial) or schedule recurrent materialization jobs (shown in a later tutorial)."
+    "Once materialization is enabled on a feature set, you can perform backfill (this tutorial) or schedule recurrent materialization jobs (shown in a later tutorial).\n",
+    "\n",
+    "#### Set spark.sql.shuffle.partitions in the yaml file according to the feature data size\n",
+    "\n",
+    "The spark configuration `spark.sql.shuffle.partitions` is an OPTIONAL parameter that can affect the number of parquet files generated (per day) when the feature set is materialized into the offline store. The default value of this parameter is 200. The best practice is to avoid generating many small parquet files. If offline feature retrieval turns out to become slow after the feature set is materialized, please go to the corresponding folder in the offline store to check whether it is the issue of having too many small parquet files (per day), and adjust the value of this parameter accordingly.\n",
+    "\n",
+    "*Note: The sample data used in this notebook is small. So this parameter is set to 1 in the code below.*"
    ]
   },
   {
@@ -1083,6 +1086,7 @@
     "        \"spark.executor.cores\": 4,\n",
     "        \"spark.executor.memory\": \"36g\",\n",
     "        \"spark.executor.instances\": 2,\n",
+    "        \"spark.sql.shuffle.partitions\": 1,\n",
     "    },\n",
     "    schedule=None,\n",
     ")\n",

diff --git a/...mple/notebooks/sdk_only/3. Enable recurrent materialization and run batch inference.ipynb b/...mple/notebooks/sdk_only/3. Enable recurrent materialization and run batch inference.ipynb
@@ -541,7 +541,7 @@
     "    name=\"azureml_1c106662-aa5e-4354-b5f9-57c1b0fdb3a7_output_data_data_with_prediction\",\n",
     "    version=\"1\",\n",
     ")\n",
-    "inf_output_df = spark.read.parquet(inf_data_output.path)\n",
+    "inf_output_df = spark.read.parquet(inf_data_output.path + \"data/*.parquet\")\n",
     "display(inf_output_df.head(5))"
    ]
   },

diff --git a/...turestore_sample/notebooks/sdk_only/4. Enable online store and run online inference.ipynb b/...turestore_sample/notebooks/sdk_only/4. Enable online store and run online inference.ipynb
@@ -36,39 +36,19 @@
     }
    },
    "source": [
-    "## Prepare the notebook environment for development\n",
-    "Note: This tutorial uses Azure Machine Learning notebook with **Serverless Spark Compute**.\n",
-    "\n",
-    "1. Clone the examples repository to your local machine: To run the tutorial, first clone the [examples repository - (azureml-examples)](https://github.com/azure/azureml-examples) with this command:\n",
-    "\n",
-    "   `git clone --depth 1 https://github.com/Azure/azureml-examples`\n",
-    "\n",
-    "   You can also download a zip file from the [examples repository (azureml-examples)](https://github.com/azure/azureml-examples). At this page, first select the `code` dropdown, and then select `Download ZIP`. Then, unzip the contents into a folder on your local device.\n",
-    "\n",
-    "2. Running the tutorial:\n",
-    "* Option 1: Create a new notebook, and execute the instructions in this document step by step. \n",
-    "* Option 2: Open the existing notebook `featurestore_sample/notebooks/sdk_only/5. Enable online store and run online inference.ipynb`. You may keep this document open and refer to it for additional explanation and documentation links.\n",
-    "\n",
-    "  1. Select **Serverless Spark Compute** in the top navigation **Compute** dropdown. This operation might take one to two minutes. Wait for a status bar in the top to display **Configure session**.\n",
-    "  2. Select **Configure session** in the top status bar.\n",
-    "  3. Select **Python packages**.\n",
-    "  4. Select **Upload conda file**.\n",
-    "  5. Select file `azureml-examples/sdk/python/featurestore-sample/project/env/online.yml` located on your local device.\n",
-    "  6. (Optional) Increase the session time-out (idle time in minutes) to reduce the serverless spark cluster startup time."
+    "# Set up"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
-   "metadata": {
-    "nteract": {
-     "transient": {
-      "deleting": false
-     }
-    }
-   },
+   "metadata": {},
    "source": [
-    "# Set up"
+    "#### Configure Azure ML spark notebook\n",
+    "\n",
+    "1. In the \"Compute\" dropdown in the top nav, select \"Serverless Spark Compute\". \n",
+    "1. Click on \"configure session\" in top status bar -> click on \"Python packages\" -> click on \"upload conda file\" -> select the file azureml-examples/sdk/python/featurestore-sample/project/env/online.yml from your local machine; Also increase the session time out (idle time) if you want to avoid running the prerequisites frequently\n",
+    "\n",
+    "\n"
    ]
   },
   {

diff --git a/.../featurestore_sample/notebooks/sdk_only/5. Develop a feature set with custom source.ipynb b/.../featurestore_sample/notebooks/sdk_only/5. Develop a feature set with custom source.ipynb
@@ -21,10 +21,7 @@
    "source": [
     "# Prerequisites\n",
     "\n",
-    "> [!NOTE]\n",
-    "> This tutorial uses Azure Machine Learning notebook with **Serverless Spark Compute**.\n",
-    "\n",
-    "1. Please ensure you have executed the first tutorial notebook that includes creation of a feature store and feature set, followed by enabling materialization and performing backfill."
+    "1. Before proceeding, please ensure that you have already completed previous three tutorials of this series. We will be reusing feature store and some other resources created in the previous tutorials."
    ]
   },
   {
@@ -37,39 +34,19 @@
     }
    },
    "source": [
-    "## Set up\n",
-    "\n",
-    "This tutorial uses the Python feature store core SDK (`azureml-featurestore`). The Python SDK is used for create, read, update, and delete (CRUD) operations, on feature stores, feature sets, and feature store entities.\n",
-    "\n",
-    "You don't need to explicitly install these resources for this tutorial, because in the set-up instructions shown here, the `conda.yaml` file covers them.\n",
-    "\n",
-    "To prepare the notebook environment for development:\n",
-    "\n",
-    "1. Clone the [azureml-examples](https://github.com/azure/azureml-examples) repository to your local GitHub resources with this command:\n",
-    "\n",
-    "   `git clone --depth 1 https://github.com/Azure/azureml-examples`\n",
-    "\n",
-    "   You can also download a zip file from the [azureml-examples](https://github.com/azure/azureml-examples) repository. At this page, first select the `code` dropdown, and then select `Download ZIP`. Then, unzip the contents into a folder on your local device.\n",
-    "\n",
-    "1. Upload the feature store samples directory to the project workspace\n",
-    "\n",
-    "   1. In the Azure Machine Learning workspace, open the Azure Machine Learning studio UI.\n",
-    "   1. Select **Notebooks** in left navigation panel.\n",
-    "   1. Select your user name in the directory listing.\n",
-    "   1. Select ellipses (**...**) and then select **Upload folder**.\n",
-    "   1. Select the feature store samples folder from the cloned directory path: `azureml-examples/sdk/python/featurestore-sample`.\n",
-    "\n",
-    "1. Run the tutorial\n",
+    "## Setup\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Configure Azure ML spark notebook\n",
     "\n",
-    "   * Option 1: Create a new notebook, and execute the instructions in this document, step by step.\n",
-    "   * Option 2: Open existing notebook `featurestore_sample/notebooks/sdk_only/5. Develop a feature set with custom source.ipynb`. You may keep this document open and refer to it for more explanation and documentation links.\n",
+    "1. In the \"Compute\" dropdown in the top nav, select \"Serverless Spark Compute\". \n",
+    "1. Click on \"configure session\" in top status bar -> click on \"Python packages\" -> click on \"upload conda file\" -> select the file azureml-examples/sdk/python/featurestore-sample/project/env/conda.yml from your local machine; Also increase the session time out (idle time) if you want to avoid running the prerequisites frequently\n",
     "\n",
-    "       1. Select **Serverless Spark Compute** in the top navigation **Compute** dropdown. This operation might take one to two minutes. Wait for a status bar in the top to display **Configure session**.\n",
-    "       1. Select **Configure session** in the top status bar.\n",
-    "       1. Select **Python packages**.\n",
-    "       1. Select **Upload conda file**.\n",
-    "       1. Select file `azureml-examples/sdk/python/featurestore-sample/project/env/conda.yml` located on your local device.\n",
-    "       1. (Optional) Increase the session time-out (idle time in minutes) to reduce the serverless spark cluster startup time."
+    "\n"
    ]
   },
   {