exasol · Shmuma · Mar 18, 2024 · Mar 18, 2024 · Mar 18, 2024
diff --git a/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/cloud/01_import_data.ipynb b/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/cloud/01_import_data.ipynb
@@ -74,12 +74,11 @@
   {
    "cell_type": "markdown",
    "source": [
-    "# Importing small amount of data from parquet files\n",
+    "# Importing data from parquet files\n",
     "\n",
-    "For the beginning, we'll load small volume of data from publicly available [Youtube 8M dataset](https://registry.opendata.aws/yt8m/).\n",
+    "For the beginning, we'll load small volume of data from publicly available [Ookla Network Performance Maps](https://registry.opendata.aws/speedtest-global-performance/), which contains aggregated network performance measurements from speedtest.net website.\n",
     "\n",
-    "In this example, we'll work with \"dataset vocabulary\" which is information about classes of videos. In total there are 3862 entries, which are stored in one single parquet file: \n",
-    "s3://aws-roda-ml-datalake/yt8m_ods/vocabulary/run-1644252350398-part-block-0-r-00000-snappy.parquet\n"
+    "In this example, we'll import only the subset of dataset - only mobile users for Q1 of 2019. In total there are 3M rows stored in parquet file on public S3 bucket: s3://ookla-open-data/parquet/performance/type=mobile/year=2019/quarter=1/2019-01-01_performance_mobile_tiles.parquet\n"
    ],
    "metadata": {
     "collapsed": false
@@ -111,20 +110,18 @@
     "For the file above, I got the following schema information:\n",
     "\n",
     "```\n",
-    "message glue_schema {\n",
-    "  optional binary Index (STRING);\n",
-    "  optional binary TrainVideoCount (STRING);\n",
-    "  optional binary KnowledgeGraphId (STRING);\n",
-    "  optional binary Name (STRING);\n",
-    "  optional binary WikiUrl (STRING);\n",
-    "  optional binary Vertical1 (STRING);\n",
-    "  optional binary Vertical2 (STRING);\n",
-    "  optional binary Vertical3 (STRING);\n",
-    "  optional binary WikiDescription (STRING);\n",
+    "message schema {\n",
+    "  optional binary quadkey (STRING);\n",
+    "  optional binary tile (STRING);\n",
+    "  optional int64 avg_d_kbps;\n",
+    "  optional int64 avg_u_kbps;\n",
+    "  optional int64 avg_lat_ms;\n",
+    "  optional int64 tests;\n",
+    "  optional int64 devices;\n",
     "}\n",
     "```  \n",
     "\n",
-    "From this schema we see that all the columns in parquet file have string type and optional (nullable).\n",
+    "From this schema we see that we have two types of columns in the parquet file - strings and integers.\n",
     "Let's create the table in our database for this data. The names of columns are not important, just the order and their types have to match with parquet file schema."
    ],
    "metadata": {
@@ -137,20 +134,18 @@
    "execution_count": 38,
    "outputs": [],
    "source": [
-    "TABLE_NAME = \"Y8M_CLASSES\"\n",
+    "TABLE_NAME = \"OOKLA_MAP\"\n",
     "\n",
     "sql = \"\"\"\n",
     "create or replace table {schema_name!i}.{table_name!i} \n",
     "(\n",
-    "    ClsIndex          VARCHAR2(1024),\n",
-    "    TrainVideoCount   VARCHAR2(1024),\n",
-    "    KnowledgeGraphId  VARCHAR2(1024),\n",
-    "    Name              VARCHAR2(1024),\n",
-    "    WikiUrl           VARCHAR2(1024),\n",
-    "    Vertical1         VARCHAR2(1024),\n",
-    "    Vertical2         VARCHAR2(1024),\n",
-    "    Vertical3         VARCHAR2(1024),\n",
-    "    WikiDescription   VARCHAR2(2048)\n",
+    "    quadkey     VARCHAR2(1024),\n",
+    "    tile        VARCHAR2(1024),\n",
+    "    avg_d_kbps  BIGINT,\n",
+    "    avg_u_kbps  BIGINT,\n",
+    "    avg_lat_ms  BIGINT,\n",
+    "    tests       BIGINT,\n",
+    "    devices     BIGINT\n",
     ")\n",
     "\"\"\"\n",
     "\n",
@@ -235,7 +230,7 @@
     "sql = \"\"\"\n",
     "IMPORT INTO {schema!i}.{table!i}\n",
     "FROM SCRIPT {schema!i}.IMPORT_PATH WITH\n",
-    "    BUCKET_PATH = 's3a://aws-roda-ml-datalake/yt8m_ods/vocabulary/*'\n",
+    "    BUCKET_PATH = 's3://ookla-open-data/parquet/performance/type=mobile/year=2019/quarter=1/*'\n",
     "    DATA_FORMAT = 'PARQUET'\n",
     "    S3_ENDPOINT = 's3-us-west-2.amazonaws.com'\n",
     "    CONNECTION_NAME = 'S3_CONNECTION';\n",
@@ -256,7 +251,7 @@
   {
    "cell_type": "markdown",
    "source": [
-    "Let's check that data was imported"
+    "Let's check that data was imported by the process above"
    ],
    "metadata": {
     "collapsed": false