Skip to content

Latest commit

 

History

History

practice-pyspark

Practice-Pyspark

Gathers data science and machine learning problem solving using PySpark, Hadoop, Hive Metastore, Delta Lake, PySpark ML and PySpark NLP.

Notebooks

  1. Simple PySpark SQL, 1.sql.ipynb.
  • Simple PySpark SQL.
  1. Simple download dataframe from HDFS, 2.simple-hdfs.ipynb.
  • Create PySpark DataFrame from HDFS.
  1. Simple PySpark SQL with Hive Metastore, 3.simple-hive.ipynb.
  • Use PySpark SQL with Hive Metastore.
  1. Simple Delta lake, 4.simple-delta-lake.ipynb.
  • Simple Delta lake.
  1. Delete Update Upsert using Delta, 5.delta-delete-update-upsert.ipynb.
  • Simple Delete Update Upsert using Delta lake.
  1. Structured streaming using Delta, 6.delta-structured-streaming-data.ipynb.
  • Simple structured streaming with Upsert using Delta streaming.
  1. Kafka Structured streaming using Delta, 7.kafka-structured-streaming-delta.ipynb.
  • Kafka structured streaming from PostgreSQL CDC using Debezium and Upsert using Delta streaming.
  1. PySpark ML text classification, 8.text-classification.ipynb.
  • Text classification using Logistic regression and multinomial in PySpark ML.
  1. PySpark ML word vector, 9.word-vector.ipynb.
  • Word vector in PySpark ML.

How-to

  1. Run HDFS and PostgreSQL for Hive Metastore,
docker container rm -f hdfs postgres
docker-compose -f misc.yaml up --build
  1. Create Hive metastore in PostgreSQL,
PGPASSWORD=postgres docker exec -it postgres psql -U postgres -d postgres -c "$(cat hive-schema-3.1.0.postgres.sql)"
  1. Run Kafka and Debezium for PostgreSQL CDC,
docker-compose -f kafka.yaml up
docker exec postgresql bash -c \
'PGPASSWORD=postgres psql -d postgres -U postgres -c "$(cat /bitnami/postgresql/conf/table.sql)"'
  1. Add PostgreSQL CDC,
curl --location --request POST http://localhost:8083/connectors/ \
--header 'Content-Type: application/json' \
--data-raw '{
  "name": "employee-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",
    "plugin.name": "pgoutput",
    "database.hostname": "postgresql",
    "database.port": "5432",
    "database.user": "postgres",
    "database.password": "postgres",
    "database.dbname" : "postgres",
    "database.server.name": "employee",
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "key.converter.schemas.enable": "true",
    "table.whitelist": "public.employee,public.salary",
    "tombstones.on.delete": "false",
    "decimal.handling.mode": "double"
  }
}'
  1. Build image,
docker build -t spark spark
  1. Run docker compose,
docker-compose up -d

Feel free to scale up the workers,

docker-compose scale worker=2