Gathers data science and machine learning problem solving using PySpark, Hadoop, Hive Metastore, Delta Lake, PySpark ML and PySpark NLP.
- Simple PySpark SQL, 1.sql.ipynb.
- Simple PySpark SQL.
- Simple download dataframe from HDFS, 2.simple-hdfs.ipynb.
- Create PySpark DataFrame from HDFS.
- Simple PySpark SQL with Hive Metastore, 3.simple-hive.ipynb.
- Use PySpark SQL with Hive Metastore.
- Simple Delta lake, 4.simple-delta-lake.ipynb.
- Simple Delta lake.
- Delete Update Upsert using Delta, 5.delta-delete-update-upsert.ipynb.
- Simple Delete Update Upsert using Delta lake.
- Structured streaming using Delta, 6.delta-structured-streaming-data.ipynb.
- Simple structured streaming with Upsert using Delta streaming.
- Kafka Structured streaming using Delta, 7.kafka-structured-streaming-delta.ipynb.
- Kafka structured streaming from PostgreSQL CDC using Debezium and Upsert using Delta streaming.
- PySpark ML text classification, 8.text-classification.ipynb.
- Text classification using Logistic regression and multinomial in PySpark ML.
- PySpark ML word vector, 9.word-vector.ipynb.
- Word vector in PySpark ML.
- Run HDFS and PostgreSQL for Hive Metastore,
docker container rm -f hdfs postgres
docker-compose -f misc.yaml up --build
- Create Hive metastore in PostgreSQL,
PGPASSWORD=postgres docker exec -it postgres psql -U postgres -d postgres -c "$(cat hive-schema-3.1.0.postgres.sql)"
- Run Kafka and Debezium for PostgreSQL CDC,
docker-compose -f kafka.yaml up
docker exec postgresql bash -c \
'PGPASSWORD=postgres psql -d postgres -U postgres -c "$(cat /bitnami/postgresql/conf/table.sql)"'
- Add PostgreSQL CDC,
curl --location --request POST http://localhost:8083/connectors/ \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "employee-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"plugin.name": "pgoutput",
"database.hostname": "postgresql",
"database.port": "5432",
"database.user": "postgres",
"database.password": "postgres",
"database.dbname" : "postgres",
"database.server.name": "employee",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true",
"table.whitelist": "public.employee,public.salary",
"tombstones.on.delete": "false",
"decimal.handling.mode": "double"
}
}'
- Build image,
docker build -t spark spark
- Run docker compose,
docker-compose up -d
Feel free to scale up the workers,
docker-compose scale worker=2