GitHub - Rolazo/globant-project

Globant's Data Engineering Coding Challenge

Local Rest API + Database + Cloud Infrastracture

Creation of a local REST API that can:

Receive historical data from CSV files
Upload these files to the new DB
Be able to insert batch transactions (1 up to 1000 rows) with one request

Section: SQL

Based on the data provied for the databases, The stakeholders ask for some specific metrics they need.

Create the following queries with the following information:

First Query: FirstNumber of employees hired for each job and department in 2021 divided by quarter. The table must be ordered alphabetically by department and job.

Second Query: List of ids, name and number of employees hired of each department that hired more employees than the mean of employees hired in 2021 for all the departments, ordered by the number of employees hired (descending).

Section: Cloud, Testing & Containers

Host Arquitecture in any public cloud
Add Automated test to the API
Containerize your application

CSV files structures

hired_emplotees.csv schema

id	name	datetime	department_id	job_id
INTEGER	STRING	STRING (ISO format)	INTEGER	INTEGER

Example hired_emplotees.csv

id	name	datetime	department_id	job_id
1	John Doe	2021-01-01T00:00:00Z	1	1
2	Jane Smith	2021-02-15T09:30:00Z	2	3
3	Michael Johnson	2021-03-27T16:45:00Z	3	2

departments.csv schema

id	department
INTEGER	STRING

example departments.csv

id	department
1	Supply Chain
2	Maintenance
3	Staff

jobs.csv schema

id	job
INTEGER	STRING

example jobs.csv

id	job
1	Recruiter
2	Manager
3	Analyst

How to reproduce the infrastructure?

Start by pulling the repository Then, create a virtual environment using python 3.8 and install the requirements following the next steps (you need to have installed Python 3.8):

You need to have installed sqlite3 3.41.2 latest stable

sudo apt install sqlite3

(Optional) For ease visualization tool you can install SQLite Browser

sudo apt install sqlitebrowser

Since this project has used GCP as public cloud provider, install Cloud SDK into your local machine. You can log into your GCP project from Cloud SDK

curl https://sdk.cloud.google.com | bash

After all the packages necesarry installed, go to the root repository

How to insert raw data to my database from CSV files?

Before adding raw data into your database, be aware of running at least one time your main.py file, since it has classes to create your database schema using SQLAlchemy Library, otherwise it wont work

To import data from CSV files into your pre-defined tables in SQLite using DB Browser for SQLite, you can follow these steps:

Open DB Browser for SQLite and open your database file.

Click on the "Execute SQL" tab to execute SQL queries.

Create temporary tables to match your pre-defined table structure. For example, you can create temporary tables for hired_employees, departments, and jobs as follows:

CREATE TABLE temp_hired_employees (
  id INTEGER,
  name TEXT,
  hire_datetime TEXT,
  department_id INTEGER,
  job_id INTEGER
);

CREATE TABLE temp_departments (
  id INTEGER,
  department TEXT
);

CREATE TABLE temp_jobs (
  id INTEGER,
  job TEXT
);

Import the CSV data into the temporary tables using the "Import" feature in DB Browser for SQLite:

Go to the "File" menu and select "Import" > "Table from CSV file." Choose the CSV file containing the data for each table and select the corresponding temporary table. Map the CSV columns to the table columns appropriately. Click "OK" to import the data into the temporary tables.

You can create raw data here --> https://extendsclass.com/csv-generator.html

Insert the data from the temporary tables into your pre-defined tables:

Execute the following SQL queries:

INSERT INTO hired_employees (id, name, hire_datetime, department_id, job_id)
SELECT id, name, hire_datetime, department_id, job_id
FROM temp_hired_employees;

INSERT INTO departments (id, department)
SELECT id, department
FROM temp_departments;

INSERT INTO jobs (id, job)
SELECT id, job
FROM temp_jobs;

Finally drop the temp tables created

DROP TABLE temp_hired_employees;
DROP TABLE temp_departments;
DROP TABLE temp_jobs;

How to run the infrastructure?

Option 1:

After creating the virtual environment, install the requerirements

pip3 install -r requirements.txt

Then run:

python main.py

Option 2:

Build and run the docker container:

docker build . -t my-flask-app

docker run -p 5000:5000 my-flask-app

How to deploy and how is deployed?

Currenlty the model is deploted in a cluster DockerHub cluster.

There are currenlty 3 endpoints

"/" Its the home of the website, where the user is going to be capable of upload data or select the specific query
"/upload" A new interface when the user can upload data, if the data is not matched with the schema it will be notified
"/query" It is the result of the query selected (either query1 or query2) Data must be present in the database otherwise it will be empty

Next Steps

To have the database running in GCP you must do the following:

You should dump the sqlite database, since theres no sqlite cloud instance in GCP. After dumping your database you cloud create a Cloud SQL instance and insert the dumping file to your SQL instance created.
Upload your docker image into Google Container Registry Change with the problem that you have created in GCP

docker tag my-flask-app gcr.io/<PROJECT-ID>/my-flask-app

Deplot the flask application into Google App Engine using Google Cloud SDK (gcloud) In the root directory of your Flask application, create an app.yaml file with the following content:

runtime: python3.8
entrypoint: gunicorn -b :$PORT app:app

env_variables:
  # Add any necessary environment variables for your Flask application
  FLASK_APP: "main
  DB_CONNECTION: "sqlite:///unix_socket/<path_to_instance>/<database_name>.db?charset=utf8"

And last run this command:

gcloud app deploy

Well done!, You have a full API Rest aplication + database running on GCP with AppEngine :D!

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
__pycache__		__pycache__
instance		instance
static/css		static/css
templates		templates
venv		venv
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt
test_main.py		test_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Globant's Data Engineering Coding Challenge

Local Rest API + Database + Cloud Infrastracture

Section: SQL

Section: Cloud, Testing & Containers

CSV files structures

How to reproduce the infrastructure?

How to insert raw data to my database from CSV files?

How to run the infrastructure?

Option 1:

Option 2:

How to deploy and how is deployed?

Next Steps

About

Releases

Packages

Languages

License

Rolazo/globant-project

Folders and files

Latest commit

History

Repository files navigation

Globant's Data Engineering Coding Challenge

Local Rest API + Database + Cloud Infrastracture

Section: SQL

Section: Cloud, Testing & Containers

CSV files structures

How to reproduce the infrastructure?

How to insert raw data to my database from CSV files?

How to run the infrastructure?

Option 1:

Option 2:

How to deploy and how is deployed?

Next Steps

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages