diff --git a/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/cloud/01_import_data.ipynb b/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/cloud/01_import_data.ipynb index 5ef42385..da6de39b 100644 --- a/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/cloud/01_import_data.ipynb +++ b/exasol/ds/sandbox/runtime/ansible/roles/jupyter/files/notebook/cloud/01_import_data.ipynb @@ -74,12 +74,11 @@ { "cell_type": "markdown", "source": [ - "# Importing small amount of data from parquet files\n", + "# Importing data from parquet files\n", "\n", - "For the beginning, we'll load small volume of data from publicly available [Youtube 8M dataset](https://registry.opendata.aws/yt8m/).\n", + "For the beginning, we'll load small volume of data from publicly available [Ookla Network Performance Maps](https://registry.opendata.aws/speedtest-global-performance/), which contains aggregated network performance measurements from speedtest.net website.\n", "\n", - "In this example, we'll work with \"dataset vocabulary\" which is information about classes of videos. In total there are 3862 entries, which are stored in one single parquet file: \n", - "s3://aws-roda-ml-datalake/yt8m_ods/vocabulary/run-1644252350398-part-block-0-r-00000-snappy.parquet\n" + "In this example, we'll import only the subset of dataset - only mobile users for Q1 of 2019. In total there are 3M rows stored in parquet file on public S3 bucket: s3://ookla-open-data/parquet/performance/type=mobile/year=2019/quarter=1/2019-01-01_performance_mobile_tiles.parquet\n" ], "metadata": { "collapsed": false @@ -111,20 +110,18 @@ "For the file above, I got the following schema information:\n", "\n", "```\n", - "message glue_schema {\n", - " optional binary Index (STRING);\n", - " optional binary TrainVideoCount (STRING);\n", - " optional binary KnowledgeGraphId (STRING);\n", - " optional binary Name (STRING);\n", - " optional binary WikiUrl (STRING);\n", - " optional binary Vertical1 (STRING);\n", - " optional binary Vertical2 (STRING);\n", - " optional binary Vertical3 (STRING);\n", - " optional binary WikiDescription (STRING);\n", + "message schema {\n", + " optional binary quadkey (STRING);\n", + " optional binary tile (STRING);\n", + " optional int64 avg_d_kbps;\n", + " optional int64 avg_u_kbps;\n", + " optional int64 avg_lat_ms;\n", + " optional int64 tests;\n", + " optional int64 devices;\n", "}\n", "``` \n", "\n", - "From this schema we see that all the columns in parquet file have string type and optional (nullable).\n", + "From this schema we see that we have two types of columns in the parquet file - strings and integers.\n", "Let's create the table in our database for this data. The names of columns are not important, just the order and their types have to match with parquet file schema." ], "metadata": { @@ -137,20 +134,18 @@ "execution_count": 38, "outputs": [], "source": [ - "TABLE_NAME = \"Y8M_CLASSES\"\n", + "TABLE_NAME = \"OOKLA_MAP\"\n", "\n", "sql = \"\"\"\n", "create or replace table {schema_name!i}.{table_name!i} \n", "(\n", - " ClsIndex VARCHAR2(1024),\n", - " TrainVideoCount VARCHAR2(1024),\n", - " KnowledgeGraphId VARCHAR2(1024),\n", - " Name VARCHAR2(1024),\n", - " WikiUrl VARCHAR2(1024),\n", - " Vertical1 VARCHAR2(1024),\n", - " Vertical2 VARCHAR2(1024),\n", - " Vertical3 VARCHAR2(1024),\n", - " WikiDescription VARCHAR2(2048)\n", + " quadkey VARCHAR2(1024),\n", + " tile VARCHAR2(1024),\n", + " avg_d_kbps BIGINT,\n", + " avg_u_kbps BIGINT,\n", + " avg_lat_ms BIGINT,\n", + " tests BIGINT,\n", + " devices BIGINT\n", ")\n", "\"\"\"\n", "\n", @@ -235,7 +230,7 @@ "sql = \"\"\"\n", "IMPORT INTO {schema!i}.{table!i}\n", "FROM SCRIPT {schema!i}.IMPORT_PATH WITH\n", - " BUCKET_PATH = 's3a://aws-roda-ml-datalake/yt8m_ods/vocabulary/*'\n", + " BUCKET_PATH = 's3://ookla-open-data/parquet/performance/type=mobile/year=2019/quarter=1/*'\n", " DATA_FORMAT = 'PARQUET'\n", " S3_ENDPOINT = 's3-us-west-2.amazonaws.com'\n", " CONNECTION_NAME = 'S3_CONNECTION';\n",