-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
murilo
committed
Sep 4, 2020
1 parent
091b971
commit fb93e05
Showing
1 changed file
with
46 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,46 @@ | ||
# ifood-data-architect-test | ||
# iFood Data Architect Test | ||
|
||
> This is my solution to the [iFood Data Architect Test](https://github.com/ifood/ifood-data-architect-test) | ||
## Test Scope | ||
|
||
Process semi-structured data and build a datalake that provides efficient storage and performance. The datalake must be organized in the following 2 layers: | ||
* raw layer: Datasets must have the same schema as the source, but support fast structured data reading | ||
* trusted layer: datamarts as required by the analysis team | ||
|
||
![Datalake](./datalake.png) | ||
|
||
Use whatever language, storage and tools you feel comfortable to. | ||
|
||
Also, briefly elaborate on your solution, datalake architecture, nomenclature, partitioning, data model and validation method. | ||
|
||
Once completed, you may submit your solution to [email protected] with the subject: iFood DArch Case Solution / Candidate Name. | ||
|
||
## Requirements | ||
|
||
* Source files: | ||
* Order: s3://ifood-data-architect-test-source/order.json.gz | ||
* Order Statuses: s3://ifood-data-architect-test-source/status.json.gz | ||
* Restaurant: s3://ifood-data-architect-test-source/restaurant.csv.gz | ||
* Consumer: s3://ifood-data-architect-test-source/consumer.csv.gz | ||
* Raw Layer (same schema from the source): | ||
* Order dataset. | ||
* Order Statuses dataset. | ||
* Restaurant dataset. | ||
* Consumer dataset. | ||
* Trusted Layer: | ||
* Order dataset - one line per order with all data from order, consumer, restaurant and the LAST status from order statuses dataset. To help analysis, it would be a nice to have: data partitioned on the restaurant LOCAL date. | ||
* Order Items dataset - easy to read dataset with one-to-many relationship with Order dataset. Must contain all data from _order_ items column. | ||
* Order statuses - Dataset containing one line per order with the timestamp for each registered event: CONCLUDED, REGISTERED, CANCELLED, PLACED. | ||
* For the trusted layer, anonymize any sensitive data. | ||
* At the end of each ETL, use any appropriated methods to validate your data. | ||
* Read performance, watch out for small files and skewed data. | ||
|
||
## Non functional requirements | ||
* Data volume increases each day. All ETLs must be built to be scalable. | ||
* Use any data storage you feel comfortable to. | ||
* Document your solution. | ||
|
||
## Hints | ||
* Databricks Community: https://community.cloud.databricks.com | ||
* all-spark-notebook docker: https://hub.docker.com/r/jupyter/all-spark-notebook/ |