Skip to content

Examples: With Row Number & Undefined Rows

Quang Hoang Xuan edited this page Aug 21, 2021 · 4 revisions

Purpose: Load Excel files with row number and keeping undefined rows

With Row Number

Let use data file issue_285_bryce21.xlsx for this example

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

ROOT="/<please-change>/to/your-path"
schema = StructType([
    StructField("RowID", IntegerType(), True),
    StructField("1", StringType(), True),
    StructField("2", StringType(), True),
    StructField("3", StringType(), True)  
])
df = spark.read.format("excel") \
    .schema(schema) \
    .option("header", False) \
    .option("columnNameOfRowNumber", "RowID") \
    .load(f"{ROOT}/src/test/resources/spreadsheets/issue_285_bryce21.xlsx")
df.show()

Got

+-----+---------+----+----+
|RowID|        1|   2|   3|
+-----+---------+----+----+
|    0|File info|null|null|
|    1|     Info|Info|Info|
|    3| Metadata|null|null|
|    5|     null|   1|   2|
|    6|        A|   1|   2|
|    7|        B|   5|   6|
|    8|        C|   9|  10|
|   11| Metadata|null|null|
|   13|     null|   1|   2|
|   14|        A|   1|   2|
|   15|        B|   4|   5|
|   16|        C|   7|   8|
+-----+---------+----+----+

Notice: there is an extra RowID column, and value 2,4,9,10 are missing.

Keeping Undefined Rows

df = spark.read.format("excel") \
    .schema(schema) \
    .option("header", False) \
    .option("keepUndefinedRows", True) \
    .option("columnNameOfRowNumber", "RowID") \
    .load(f"{ROOT}/src/test/resources/spreadsheets/issue_285_bryce21.xlsx")
df.show()

Got

+-----+---------+----+----+
|RowID|        1|   2|   3|
+-----+---------+----+----+
|    0|File info|null|null|
|    1|     Info|Info|Info|
| null|     null|null|null|
|    3| Metadata|null|null|
| null|     null|null|null|
|    5|     null|   1|   2|
|    6|        A|   1|   2|
|    7|        B|   5|   6|
|    8|        C|   9|  10|
| null|     null|null|null|
| null|     null|null|null|
|   11| Metadata|null|null|
| null|     null|null|null|
|   13|     null|   1|   2|
|   14|        A|   1|   2|
|   15|        B|   4|   5|
|   16|        C|   7|   8|
+-----+---------+----+----+

There are rows for undefined Excel rows now. For undefined Excel row, its row-number is also null.

References