- Downlaod all required file from below URL:
https://drive.google.com/drive/folders/1rBauyUVCRTbnKXgkMGh4l9MdIOVj8CQc?usp=sharing
- Install java .exe file
note: choose installtion path of java to "C:" drive
-
Extract spark file in C drive
-
Extract kafka file in C drive
-
Add environment variable
ENVIRONMENT VARIABLE NAME | VALUE |
---|---|
HADOOP_HOME | C:\winutils |
JAVA_HOME | C:\Java\jdk1.8.0_202 |
SPARK_HOME | C:\spark-3.0.3-bin-hadoop2.7 |
- select path variable from environment variable and add below values.
%SPARK_HOME%\bin
%HADOOP_HOME%\bin
%JAVA_HOME%\bin
C:\Java\jre1.8.0_281\bin
- open conda terminal execute below command
conda create -n <env_name> python=3.8 -y
-
select <env_name> created in previous step for project interpreter in pycharm.
-
Install all necessary python library specified in requirements.txt file using below command.
pip install -r requirements.txt
- To upload your code to gihub repo
git init
git add .
git commit -m "first commit"
git branch -M main
git remote add origin <github_repo_link>
git push -u origin main
python training\stage_00_data_loader.py
python training\stage_01_data_validator.py
python training\stage_02_data_transformer.py
python training\stage_03_data_exporter.py
spark-submit training\stage_04_model_trainer.py
python prediction\stage_00_data_loader.py
python prediction\stage_01_data_validator.py
python prediction\stage_02_data_transformer.py
python prediction\stage_03_data_exporter.py
spark-submit prediction\stage_04_model_predictor.py
spark-submit csv_to_kafka.py
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 spark_consumer_from_kafka.py
Credits: Avnish Yadav