Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
murilo committed Sep 5, 2020
1 parent 1757171 commit 5c5039e
Show file tree
Hide file tree
Showing 3 changed files with 119 additions and 41 deletions.
64 changes: 23 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,28 @@
# iFood Data Architect Test

> This is my solution to the [iFood Data Architect Test](https://github.com/ifood/ifood-data-architect-test)
## Scope

This is my solution to the [iFood Data Architect Test](https://github.com/ifood/ifood-data-architect-test)

> Work in progress! Solution not ready yet...
## How to Run

### Requirements

* `docker >= 19.03.9`
* `docker-compose >= 1.25.0`

### Step-by-step

* Run docker-compose at root directory

```bash
docker-compose up --build
```

* Run [notebooks](./spark-dev-env/docker-img-volume/notebooks) as you wish.

## Test Scope

Process semi-structured data and build a datalake that provides efficient storage and performance. The datalake must be organized in the following 2 layers:
* raw layer: Datasets must have the same schema as the source, but support fast structured data reading
* trusted layer: datamarts as required by the analysis team

![Datalake](./media/datalake.png)

Use whatever language, storage and tools you feel comfortable to.

Also, briefly elaborate on your solution, datalake architecture, nomenclature, partitioning, data model and validation method.

Once completed, you may submit your solution to [email protected] with the subject: iFood DArch Case Solution / Candidate Name.

## Requirements

* Source files:
* Order: s3://ifood-data-architect-test-source/order.json.gz
* Order Statuses: s3://ifood-data-architect-test-source/status.json.gz
* Restaurant: s3://ifood-data-architect-test-source/restaurant.csv.gz
* Consumer: s3://ifood-data-architect-test-source/consumer.csv.gz
* Raw Layer (same schema from the source):
* Order dataset.
* Order Statuses dataset.
* Restaurant dataset.
* Consumer dataset.
* Trusted Layer:
* Order dataset - one line per order with all data from order, consumer, restaurant and the LAST status from order statuses dataset. To help analysis, it would be a nice to have: data partitioned on the restaurant LOCAL date.
* Order Items dataset - easy to read dataset with one-to-many relationship with Order dataset. Must contain all data from _order_ items column.
* Order statuses - Dataset containing one line per order with the timestamp for each registered event: CONCLUDED, REGISTERED, CANCELLED, PLACED.
* For the trusted layer, anonymize any sensitive data.
* At the end of each ETL, use any appropriated methods to validate your data.
* Read performance, watch out for small files and skewed data.

## Non functional requirements
* Data volume increases each day. All ETLs must be built to be scalable.
* Use any data storage you feel comfortable to.
* Document your solution.

## Hints
* Databricks Community: https://community.cloud.databricks.com
* all-spark-notebook docker: https://hub.docker.com/r/jupyter/all-spark-notebook/
Please find the code challenge [here](./TestScope.md).
44 changes: 44 additions & 0 deletions TestScope.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# iFood Data Architect Test

## Test Scope

Process semi-structured data and build a datalake that provides efficient storage and performance. The datalake must be organized in the following 2 layers:
* raw layer: Datasets must have the same schema as the source, but support fast structured data reading
* trusted layer: datamarts as required by the analysis team

![Datalake](./media/datalake.png)

Use whatever language, storage and tools you feel comfortable to.

Also, briefly elaborate on your solution, datalake architecture, nomenclature, partitioning, data model and validation method.

Once completed, you may submit your solution to [email protected] with the subject: iFood DArch Case Solution / Candidate Name.

## Requirements

* Source files:
* Order: s3://ifood-data-architect-test-source/order.json.gz
* Order Statuses: s3://ifood-data-architect-test-source/status.json.gz
* Restaurant: s3://ifood-data-architect-test-source/restaurant.csv.gz
* Consumer: s3://ifood-data-architect-test-source/consumer.csv.gz
* Raw Layer (same schema from the source):
* Order dataset.
* Order Statuses dataset.
* Restaurant dataset.
* Consumer dataset.
* Trusted Layer:
* Order dataset - one line per order with all data from order, consumer, restaurant and the LAST status from order statuses dataset. To help analysis, it would be a nice to have: data partitioned on the restaurant LOCAL date.
* Order Items dataset - easy to read dataset with one-to-many relationship with Order dataset. Must contain all data from _order_ items column.
* Order statuses - Dataset containing one line per order with the timestamp for each registered event: CONCLUDED, REGISTERED, CANCELLED, PLACED.
* For the trusted layer, anonymize any sensitive data.
* At the end of each ETL, use any appropriated methods to validate your data.
* Read performance, watch out for small files and skewed data.

## Non functional requirements
* Data volume increases each day. All ETLs must be built to be scalable.
* Use any data storage you feel comfortable to.
* Document your solution.

## Hints
* Databricks Community: https://community.cloud.databricks.com
* all-spark-notebook docker: https://hub.docker.com/r/jupyter/all-spark-notebook/
52 changes: 52 additions & 0 deletions spark-dev-env/docker-img-volume/notebooks/02-ingest-raw-data.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import boto3"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"cl = boto3.client('s3')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cl.download_file()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

0 comments on commit 5c5039e

Please sign in to comment.