The most common first step in data processing applications, is to take data from some source and get it into a format that is suitable for reporting and other forms of analytics. In a database, you would load a flat file into the database and create indexes. In Spark, your first step is usually to clean and convert data from a text format into Parquet format. Parquet is an optimized binary format supporting efficient reads, making it ideal for reporting and analytics.
Before you begin:
- Ensure your tenant is configured according to the instructions to setup admin
- Know your object store namespace.
- Know the OCID of a compartment where you want to load your data and create applications.
- (Optional, strongly recommended): Install Spark to test your code locally before deploying.
- Upload a sample CSV file of your choice to object store.
- Upload
csv_to_parquet.py
to object store. - Create a Python Data Flow Application pointing to
csv_to_parquet.py
3a. Refer here 3b. The Spark application requires two arguments: --input-path and --output-path. These must be OCI HDFS URIs pointing to your source CSV file and target output path. Put these in the Arguments field of the Data Flow Application. 3c. Example Arguments field: "--input-path oci://sample@namespace/input.csv --output-path oci://sample@namespace/output.parquet"
Set all these variables based on your OCI tenancy.
COMPARTMENT_ID=ocid1.compartment.oc1..<your_compartment_id>
NAMESPACE=my_object_storage_namespace
BUCKET=my_bucket
INPUT_PATH=oci://$BUCKET@$NAMESPACE/my_csv_file.csv
OUTPUT_PATH=oci://$BUCKET@$NAMESPACE/output_parquet
Run these commands to upload all files.
oci os bucket create --name $BUCKET --compartment-id $COMPARTMENT_ID
oci os object put --bucket-name $BUCKET --file my_csv_file.csv
oci os object put --bucket-name $BUCKET --file csv_to_parquet.py
Launch the Spark application to convert CSV to Parquet.
oci data-flow run submit \
--compartment-id $COMPARTMENT_ID \
--display-name "PySpark Convert CSV to Parquet" \
--execute "oci://$BUCKET@$NAMESPACE/csv_to_parquet.py --input-path $INPUT_PATH --output-path $OUTPUT_PATH"
Make note of "id" field this command returns. When the job is finished, view its output using:
oci data-flow run get-log --run-id <run_id> --name spark_application_stdout.log.gz --file -