Orchestration of data processing tasks to power the reporting TPOT ETP API
Table of Contents
-
Clone the project and cd into the folder.
git clone https://github.com/workforce-data-initiative/tpot-airflow.git && cd tpot-airflow
To test it out real quick using Docker just run:
docker-compose up
and explore the UI at localhost:8080.
Then run the scheduler in that same container
docker-compose exec web airflow scheduler
-
Install requirements (preferably in a virtual environment)
pip install -r requirements.txt
Note that the project is using Python 3.6.2 in development
-
Prepare the home for
airflow
:export AIRFLOW_HOME=$(pwd)
Follow through steps 1 to 3:
Running sh setup.sh
is step 1, 2 and 3 in a single script. Then get to localhost:8080.
-
Initialize the meta database by running:
airflow initdb
-
Setup airflow:
python config/remove_airflow_examples.py airflow resetdb -y export APP=TPOT [or some other name] (Optional) python config/customize_dashboard.dev.py (Optional)
Running python customize_dashboard.dev.py
customizes the dashboard to read TPOT - Airflow instead of Airflow
- Start the airflow webserver and explore the UI at localhost:8080.
airflow webserver
Note that you have optional arguments:
-p=8080, --port=8080
to specify which port to run the server-w=4, --workers=4
to specify the number of workers to run the webserver on
RUN docker build -t tpot-airflow -f Dockerfile.dev .
RUN sh heroku.sh
-
Setup an EC2 instance in AWS (ensure that you download the
.pem
file) -
Authorise inbound traffic for this instance by adding a rule to the security group to accept traffic on port
8080
(explained here) -
Connect to the instance via
ssh
(explained here).Run the following:
sudo yum install git
git clone https://github.com/workforce-data-initiative/tpot-airflow.git
cd tpot-airflow
sh aws_setup.sh
sh docker_setup.sh
logout
- then ssh into the container again to pick up the new docker group permissionstmux
docker-compose up -d
It is advised that the codebase is modified in Github. Pull any update done to the codebase by running:
git pull origin master
- or the relevant branch
For you to ssh into an already running instance, ask for the .pem
and run:
ssh -i "<>.pem" ec2-user@<Public DNS>
For example: ssh -i "airflow.pem" [email protected]
You'll need to ssh to setup keys
intentionally not included on the codebase.
Please confirm if the issue has not been raised then you can open an issue for support.
Please contribute using Github Flow. Create a branch, add commits, and open a pull request.