Skip to content
Ruslan Yushchenko edited this page Aug 1, 2018 · 2 revisions

Cobrix - COBOL Data Source for Spark

The purpose of the project is to provide reading and processing data files uploaded from mainframes.

Data that comes from mainframe databases such as IMS and DB2 are presented in a binary form formatted according to COBOL copybooks layout (schema). Copybook is a schema definition of binary data described in COBOL programming language. String encoding of the data is EBCDIC. The data itself is stored in a binary format. Copybooks specify most of the binary data format and layout, but there are a lot of caveats.

The purpose of the project is to make as much generic COBOL reader for Spark as possible.

Setup

To use the COBOL files reader just add this Maven dependency:

<dependency>
   <groupId>za.co.absa.cobrix</groupId>
   <artifactId>spark-cobol</artifactId>
   <version>0.1.6</version>
</dependency>

Example

Here is a simple example Spark application that loads a COBOL copybook and data file, and then prints it's schema and contents:

package za.co.absa.spark.app

import org.apache.spark.sql.{SaveMode, SparkSession}

object SmallSparkApplication {

  def main(args: Array[String]): Unit = {

    // Initializing a Spark session
    val spark = SparkSession
      .builder()
      .appName("Spark Cobol Application example 1")
      .master("local[*]")
      .getOrCreate()

    // Here a cobol binary files located in 'data/test1_data' are read
    // using 'data/test1_copybook.cob' copybook as it's schema.
    // The result is a usual Spark dataframe.
    val df = spark
      .read
      .format("cobol")
      .option("copybook", "data/test1_copybook.cob")
      .load("data/test1_data")

    // Print Spark schema of the dataframe
    df.printSchema
    // Fetch and show first 20 records of the dataframe
    df.show

    // Fetch the first 2 records and print it in JSON format
    println("The first 2 records in JSON format:")
    df.toJSON.take(2).foreach(println)

    // Save (convert) the results to Parquet format
    df.write.mode(SaveMode.Overwrite)
      .parquet("data/output")
  }
}

Acknowledgment

TODO

Clone this wiki locally