-
Notifications
You must be signed in to change notification settings - Fork 78
Home
The purpose of the project is to provide reading and processing data files uploaded from mainframes.
Data that comes from mainframe databases such as IMS and DB2 are presented in a binary form formatted according to COBOL copybooks layout (schema). Copybook is a schema definition of binary data described in COBOL programming language. String encoding of the data is EBCDIC. The data itself is stored in a binary format. Copybooks specify most of the binary data format and layout, but there are a lot of caveats.
The purpose of the project is to make as much generic COBOL reader for Spark as possible.
To use the COBOL files reader just add this Maven dependency:
<dependency>
<groupId>za.co.absa.cobrix</groupId>
<artifactId>spark-cobol</artifactId>
<version>0.1.6</version>
</dependency>
Here is a simple example Spark application that loads a COBOL copybook and data file, and then prints it's schema and contents:
package za.co.absa.spark.app
import org.apache.spark.sql.{SaveMode, SparkSession}
object SmallSparkApplication {
def main(args: Array[String]): Unit = {
// Initializing a Spark session
val spark = SparkSession
.builder()
.appName("Spark Cobol Application example 1")
.master("local[*]")
.getOrCreate()
// Here a cobol binary files located in 'data/test1_data' are read
// using 'data/test1_copybook.cob' copybook as it's schema.
// The result is a usual Spark dataframe.
val df = spark
.read
.format("cobol")
.option("copybook", "data/test1_copybook.cob")
.load("data/test1_data")
// Print Spark schema of the dataframe
df.printSchema
// Fetch and show first 20 records of the dataframe
df.show
// Fetch the first 2 records and print it in JSON format
println("The first 2 records in JSON format:")
df.toJSON.take(2).foreach(println)
// Save (convert) the results to Parquet format
df.write.mode(SaveMode.Overwrite)
.parquet("data/output")
}
}
TODO