Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cloud storage extension dataset #256

Merged
merged 2 commits into from
Mar 18, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -74,12 +74,11 @@
{
"cell_type": "markdown",
"source": [
"# Importing small amount of data from parquet files\n",
"# Importing data from parquet files\n",
"\n",
"For the beginning, we'll load small volume of data from publicly available [Youtube 8M dataset](https://registry.opendata.aws/yt8m/).\n",
"For the beginning, we'll load small volume of data from publicly available [Ookla Network Performance Maps](https://registry.opendata.aws/speedtest-global-performance/), which contains aggregated network performance measurements from speedtest.net website.\n",
"\n",
"In this example, we'll work with \"dataset vocabulary\" which is information about classes of videos. In total there are 3862 entries, which are stored in one single parquet file: \n",
"s3://aws-roda-ml-datalake/yt8m_ods/vocabulary/run-1644252350398-part-block-0-r-00000-snappy.parquet\n"
"In this example, we'll import only the subset of dataset - only mobile users for Q1 of 2019. In total there are 3M rows stored in parquet file on public S3 bucket: s3://ookla-open-data/parquet/performance/type=mobile/year=2019/quarter=1/2019-01-01_performance_mobile_tiles.parquet\n"
],
"metadata": {
"collapsed": false
Expand Down Expand Up @@ -111,20 +110,18 @@
"For the file above, I got the following schema information:\n",
"\n",
"```\n",
"message glue_schema {\n",
" optional binary Index (STRING);\n",
" optional binary TrainVideoCount (STRING);\n",
" optional binary KnowledgeGraphId (STRING);\n",
" optional binary Name (STRING);\n",
" optional binary WikiUrl (STRING);\n",
" optional binary Vertical1 (STRING);\n",
" optional binary Vertical2 (STRING);\n",
" optional binary Vertical3 (STRING);\n",
" optional binary WikiDescription (STRING);\n",
"message schema {\n",
" optional binary quadkey (STRING);\n",
" optional binary tile (STRING);\n",
" optional int64 avg_d_kbps;\n",
" optional int64 avg_u_kbps;\n",
" optional int64 avg_lat_ms;\n",
" optional int64 tests;\n",
" optional int64 devices;\n",
"}\n",
"``` \n",
"\n",
"From this schema we see that all the columns in parquet file have string type and optional (nullable).\n",
"From this schema we see that we have two types of columns in the parquet file - strings and integers.\n",
"Let's create the table in our database for this data. The names of columns are not important, just the order and their types have to match with parquet file schema."
],
"metadata": {
Expand All @@ -137,20 +134,18 @@
"execution_count": 38,
"outputs": [],
"source": [
"TABLE_NAME = \"Y8M_CLASSES\"\n",
"TABLE_NAME = \"OOKLA_MAP\"\n",
"\n",
"sql = \"\"\"\n",
"create or replace table {schema_name!i}.{table_name!i} \n",
"(\n",
" ClsIndex VARCHAR2(1024),\n",
" TrainVideoCount VARCHAR2(1024),\n",
" KnowledgeGraphId VARCHAR2(1024),\n",
" Name VARCHAR2(1024),\n",
" WikiUrl VARCHAR2(1024),\n",
" Vertical1 VARCHAR2(1024),\n",
" Vertical2 VARCHAR2(1024),\n",
" Vertical3 VARCHAR2(1024),\n",
" WikiDescription VARCHAR2(2048)\n",
" quadkey VARCHAR2(1024),\n",
" tile VARCHAR2(1024),\n",
" avg_d_kbps BIGINT,\n",
" avg_u_kbps BIGINT,\n",
" avg_lat_ms BIGINT,\n",
" tests BIGINT,\n",
" devices BIGINT\n",
")\n",
"\"\"\"\n",
"\n",
Expand Down Expand Up @@ -235,7 +230,7 @@
"sql = \"\"\"\n",
"IMPORT INTO {schema!i}.{table!i}\n",
"FROM SCRIPT {schema!i}.IMPORT_PATH WITH\n",
" BUCKET_PATH = 's3a://aws-roda-ml-datalake/yt8m_ods/vocabulary/*'\n",
" BUCKET_PATH = 's3://ookla-open-data/parquet/performance/type=mobile/year=2019/quarter=1/*'\n",
" DATA_FORMAT = 'PARQUET'\n",
" S3_ENDPOINT = 's3-us-west-2.amazonaws.com'\n",
" CONNECTION_NAME = 'S3_CONNECTION';\n",
Expand All @@ -256,7 +251,7 @@
{
"cell_type": "markdown",
"source": [
"Let's check that data was imported"
"Let's check that data was imported by the process above"
],
"metadata": {
"collapsed": false
Expand Down
Loading