diff --git a/README.md b/README.md index d0fce07..14a8a30 100644 --- a/README.md +++ b/README.md @@ -1,46 +1,28 @@ # iFood Data Architect Test -> This is my solution to the [iFood Data Architect Test](https://github.com/ifood/ifood-data-architect-test) +## Scope + +This is my solution to the [iFood Data Architect Test](https://github.com/ifood/ifood-data-architect-test) + +> Work in progress! Solution not ready yet... + +## How to Run + +### Requirements + +* `docker >= 19.03.9` +* `docker-compose >= 1.25.0` + +### Step-by-step + +* Run docker-compose at root directory + +```bash +docker-compose up --build +``` + +* Run [notebooks](./spark-dev-env/docker-img-volume/notebooks) as you wish. ## Test Scope -Process semi-structured data and build a datalake that provides efficient storage and performance. The datalake must be organized in the following 2 layers: -* raw layer: Datasets must have the same schema as the source, but support fast structured data reading -* trusted layer: datamarts as required by the analysis team - -![Datalake](./media/datalake.png) - -Use whatever language, storage and tools you feel comfortable to. - -Also, briefly elaborate on your solution, datalake architecture, nomenclature, partitioning, data model and validation method. - -Once completed, you may submit your solution to ifoodbrain_hiring@ifood.com.br with the subject: iFood DArch Case Solution / Candidate Name. - -## Requirements - -* Source files: - * Order: s3://ifood-data-architect-test-source/order.json.gz - * Order Statuses: s3://ifood-data-architect-test-source/status.json.gz - * Restaurant: s3://ifood-data-architect-test-source/restaurant.csv.gz - * Consumer: s3://ifood-data-architect-test-source/consumer.csv.gz -* Raw Layer (same schema from the source): - * Order dataset. - * Order Statuses dataset. - * Restaurant dataset. - * Consumer dataset. -* Trusted Layer: - * Order dataset - one line per order with all data from order, consumer, restaurant and the LAST status from order statuses dataset. To help analysis, it would be a nice to have: data partitioned on the restaurant LOCAL date. - * Order Items dataset - easy to read dataset with one-to-many relationship with Order dataset. Must contain all data from _order_ items column. - * Order statuses - Dataset containing one line per order with the timestamp for each registered event: CONCLUDED, REGISTERED, CANCELLED, PLACED. -* For the trusted layer, anonymize any sensitive data. -* At the end of each ETL, use any appropriated methods to validate your data. -* Read performance, watch out for small files and skewed data. - -## Non functional requirements -* Data volume increases each day. All ETLs must be built to be scalable. -* Use any data storage you feel comfortable to. -* Document your solution. - -## Hints -* Databricks Community: https://community.cloud.databricks.com -* all-spark-notebook docker: https://hub.docker.com/r/jupyter/all-spark-notebook/ +Please find the code challenge [here](./TestScope.md). \ No newline at end of file diff --git a/TestScope.md b/TestScope.md new file mode 100644 index 0000000..b7231a1 --- /dev/null +++ b/TestScope.md @@ -0,0 +1,44 @@ +# iFood Data Architect Test + +## Test Scope + +Process semi-structured data and build a datalake that provides efficient storage and performance. The datalake must be organized in the following 2 layers: +* raw layer: Datasets must have the same schema as the source, but support fast structured data reading +* trusted layer: datamarts as required by the analysis team + +![Datalake](./media/datalake.png) + +Use whatever language, storage and tools you feel comfortable to. + +Also, briefly elaborate on your solution, datalake architecture, nomenclature, partitioning, data model and validation method. + +Once completed, you may submit your solution to ifoodbrain_hiring@ifood.com.br with the subject: iFood DArch Case Solution / Candidate Name. + +## Requirements + +* Source files: + * Order: s3://ifood-data-architect-test-source/order.json.gz + * Order Statuses: s3://ifood-data-architect-test-source/status.json.gz + * Restaurant: s3://ifood-data-architect-test-source/restaurant.csv.gz + * Consumer: s3://ifood-data-architect-test-source/consumer.csv.gz +* Raw Layer (same schema from the source): + * Order dataset. + * Order Statuses dataset. + * Restaurant dataset. + * Consumer dataset. +* Trusted Layer: + * Order dataset - one line per order with all data from order, consumer, restaurant and the LAST status from order statuses dataset. To help analysis, it would be a nice to have: data partitioned on the restaurant LOCAL date. + * Order Items dataset - easy to read dataset with one-to-many relationship with Order dataset. Must contain all data from _order_ items column. + * Order statuses - Dataset containing one line per order with the timestamp for each registered event: CONCLUDED, REGISTERED, CANCELLED, PLACED. +* For the trusted layer, anonymize any sensitive data. +* At the end of each ETL, use any appropriated methods to validate your data. +* Read performance, watch out for small files and skewed data. + +## Non functional requirements +* Data volume increases each day. All ETLs must be built to be scalable. +* Use any data storage you feel comfortable to. +* Document your solution. + +## Hints +* Databricks Community: https://community.cloud.databricks.com +* all-spark-notebook docker: https://hub.docker.com/r/jupyter/all-spark-notebook/ diff --git a/spark-dev-env/docker-img-volume/notebooks/02-ingest-raw-data.ipynb b/spark-dev-env/docker-img-volume/notebooks/02-ingest-raw-data.ipynb new file mode 100644 index 0000000..1575d22 --- /dev/null +++ b/spark-dev-env/docker-img-volume/notebooks/02-ingest-raw-data.ipynb @@ -0,0 +1,52 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import boto3" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "cl = boto3.client('s3')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cl.download_file()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}