This open source project is to provide a solid foundation on building a data pipeline end to end using purely open source technology with zero licensed solutions, you can use this in your own learning and incorporate into your data science workflows.
Our motivation was to empower you with creating big data workflows from beginning to end.
- Install python3 on your workstation
- Install Docker on your workstation
Note: if you are running Docker Desktop, allocate at least 3 GB for memory and 4 CPUs.
- Right click on Docker Desktop icon
- Select Preferences
- Select Resources
- Set CPUs = 4
- Set Memory to at least 4GB
- Press the Apply & Restart button to make the changes.
- Set up and install Docker
- Download the kafka connectors
mkdir jars
cd jars/
curl -L -O https://cassandra-kafka-elasticsearch-open-source.s3-us-west-1.amazonaws.com/kafka-connect-rest-plugin-1.0.3-shaded.jar
curl -L -O https://cassandra-kafka-elasticsearch-open-source.s3-us-west-1.amazonaws.com/kafka-connect-transform-add-headers-1.0.3-shaded.jar
curl -L -O https://cassandra-kafka-elasticsearch-open-source.s3-us-west-1.amazonaws.com/kafka-connect-transform-from-json-plugin-1.0.3-shaded.jar
curl -L -O https://cassandra-kafka-elasticsearch-open-source.s3-us-west-1.amazonaws.com/kafka-connect-transform-velocity-eval-1.0.3-shaded.jar
curl -L -O https://cassandra-kafka-elasticsearch-open-source.s3-us-west-1.amazonaws.com/kafka-connect-elastic6-1.2.3-2.1.0-all.jar
curl -L -O https://cassandra-kafka-elasticsearch-open-source.s3-us-west-1.amazonaws.com/kafka-connect-cassandra-1.2.3-2.1.0-all.jar
cd ..
- docker-compose up --force-recreate -V
-
Open COVID 19 Lab to run a local Covid19 Data science workbench