A library for querying Excel files with Apache Spark, for Spark SQL and DataFrames.
Due to personal and professional constraints, the development of this library has been rather slow. If you find value in this library, please consider stepping up as a co-maintainer by leaving a comment here. Help is very welcome e.g. in the following areas:
- Additional features
- Code improvements and reviews
- Bug analysis and fixing
- Documentation improvements
- Build / test infrastructure
This library requires Spark 2.0+.
List of spark versions, those are automatically tested:
spark: ["2.4.1", "2.4.7", "2.4.8", "3.0.1", "3.0.3", "3.1.1", "3.1.2", "3.2.1"]
For more detail, please refer to project CI: ci.yml
You can link against this library in your program at the following coordinates:
groupId: com.crealytics
artifactId: spark-excel_2.12
version: <spark-version>_0.17.0
groupId: com.crealytics
artifactId: spark-excel_2.11
version: <spark-version>_0.13.7
This package can be added to Spark using the --packages
command line option. For example, to include it when starting the spark shell:
$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:<spark-version>_0.17.0
$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.11:<spark-version>_0.13.7
- This package allows querying Excel spreadsheets as Spark DataFrames.
- From spark-excel 0.14.0 (August 24, 2021), there are two implementation of spark-excel
- Original Spark-Excel with Spark data source API 1.0
- Spark-Excel V2 with data source API V2.0+, which supports loading from multiple files, corrupted record handling and some improvement on handling data types. See below for further details
To use V2 implementation, just change your .format from .format("com.crealytics.spark.excel")
to .format("excel")
.
See below for some details
See the changelog for latest features, fixes etc.
Spark 2.0+:
import org.apache.spark.sql._
val spark: SparkSession = ???
val df = spark.read
.format("com.crealytics.spark.excel") // Or .format("excel") for V2 implementation
.option("dataAddress", "'My Sheet'!B3:C35") // Optional, default: "A1"
.option("header", "true") // Required
.option("treatEmptyValuesAsNulls", "false") // Optional, default: true
.option("setErrorCellsToFallbackValues", "true") // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.
.option("usePlainNumberFormat", "false") // Optional, default: false, If true, format the cells without rounding and scientific notations
.option("inferSchema", "false") // Optional, default: false
.option("addColorColumns", "true") // Optional, default: false
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)
.option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
.option("workbookPassword", "pass") // Optional, default None. Requires unlimited strength JCE for older JVMs
.schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
.load("Worktime.xlsx")
For convenience, there is an implicit that wraps the DataFrameReader
returned by spark.read
and provides a .excel
method which accepts all possible options and provides default values:
import org.apache.spark.sql._
import com.crealytics.spark.excel._
val spark: SparkSession = ???
val df = spark.read.excel(
header = true, // Required
dataAddress = "'My Sheet'!B3:C35", // Optional, default: "A1"
treatEmptyValuesAsNulls = false, // Optional, default: true
setErrorCellsToFallbackValues = false, // Optional, default: false, where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column's data type.
usePlainNumberFormat = false, // Optional, default: false. If true, format the cells without rounding and scientific notations
inferSchema = false, // Optional, default: false
addColorColumns = true, // Optional, default: false
timestampFormat = "MM-dd-yyyy HH:mm:ss", // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
maxRowsInMemory = 20, // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)
excerptSize = 10, // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
workbookPassword = "pass" // Optional, default None. Requires unlimited strength JCE for older JVMs
).schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings
.load("Worktime.xlsx")
If the sheet name is unavailable, it is possible to pass in an index:
val df = spark.read.excel(
header = true,
dataAddress = "0!B3:C35"
).load("Worktime.xlsx")
or to read in the names dynamically:
import com.crealytics.spark.excel.WorkbookReader
val sheetNames = WorkbookReader( Map("path" -> "Worktime.xlsx")
, spark.sparkContext.hadoopConfiguration
).sheetNames
val df = spark.read.excel(
header = true,
dataAddress = sheetNames(0)
)
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val peopleSchema = StructType(Array(
StructField("Name", StringType, nullable = false),
StructField("Age", DoubleType, nullable = false),
StructField("Occupation", StringType, nullable = false),
StructField("Date of birth", StringType, nullable = false)))
val spark: SparkSession = ???
val df = spark.read
.format("com.crealytics.spark.excel") // Or .format("excel") for V2 implementation
.option("dataAddress", "'Info'!A1")
.option("header", "true")
.schema(peopleSchema)
.load("People.xlsx")
import org.apache.spark.sql._
val df: DataFrame = ???
df.write
.format("com.crealytics.spark.excel") // Or .format("excel") for V2 implementation
.option("dataAddress", "'My Sheet'!B3:C35")
.option("header", "true")
.option("dateFormat", "yy-mmm-d") // Optional, default: yy-m-d h:mm
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss.000
.mode("append") // Optional, default: overwrite.
.save("Worktime2.xlsx")
As you can see in the examples above,
the location of data to read or write can be specified with the dataAddress
option.
Currently the following address styles are supported:
B3
: Start cell of the data. Reading will return all rows below and all columns to the right. Writing will start here and use as many columns and rows as required.B3:F35
: Cell range of data. Reading will return only rows and columns in the specified range. Writing will start in the first cell (B3
in this example) and use only the specified columns and rows. If there are more rows or columns in the DataFrame to write, they will be truncated. Make sure this is what you want.'My Sheet'!B3:F35
: Same as above, but with a specific sheet.MyTable[#All]
: Table of data. Reading will return all rows and columns in this table. Writing will only write within the current range of the table. No growing of the table will be performed. PRs to change this are welcome.
The V2 API offers you several improvements when it comes to file and folder handling. and works in a very similar way than data sources like csv and parquet.
To use V2 implementation, just change your .format from .format("com.crealytics.spark.excel")
to .format("excel")
The big difference is the fact that you provide a path to read / write data from/to and not an individual single file only:
dataFrame.write
.format("excel")
.save("some/path")
spark.read
.format("excel")
// ... insert excel read specific options you need
.load("some/path")
Because folders are supported you can read/write from/to a "partitioned" folder structure, just the same way as csv or parquet. Note that writing partitioned structures is only available for spark >=3.0.1
dataFrame.write
.partitionBy("col1")
.format("excel")
.save("some/path")
Need some more examples? Check out the test cases or have a look at our wiki
This library is built with SBT.
To build a JAR file simply run sbt assembly
from the project root.
To build for a specific spark version, for example spark-2.4.1, run sbt -Dspark.testVersion=2.4.1 assembly
,
also from the project root.
The build configuration includes support for Scala 2.12 and 2.11.