-
Notifications
You must be signed in to change notification settings - Fork 149
Examples: Load Multiple Files
Quang Hoang Xuan edited this page Aug 29, 2021
·
10 revisions
Purpose: Load multiple Excel files into single data frame
Spark Excel supports loading multiple excel files with glob pattern as well as Key=Value structured folder. For example, in test's resource folder, there is an example ca_dataset:
src/test/resources/spreadsheets/ca_dataset/
└── 2019
├── Quarter=1
│ └── ca_03.xlsx
├── Quarter=2
│ ├── ca_04.xlsx
│ ├── ca_05.xlsx
│ └── ca_06.xlsx
├── Quarter=3
│ ├── ca_07.xlsx
│ ├── ca_08.xlsx
│ └── ca_09.xlsx
└── Quarter=4
├── ca_10.xlsx
├── ca_11.xlsx
└── ca_12.xlsx
5 directories, 10 files
Note: Names and IDs in this ca_dataset are not real.
This dataset can be loaded with glob pattern
ROOT="/<please-change>/to/your-path"
df = spark.read.format("excel") \
.option("header", True) \
.option("inferSchema", True) \
.load(f"{ROOT}/src/test/resources/spreadsheets/ca_dataset/2019/*/*.xlsx")
With glob pattern, we can load subset of data by using a matching pattern
df = spark.read.format("excel") \
.option("header", True) \
.option("inferSchema", True) \
.load(f"{ROOT}/src/test/resources/spreadsheets/ca_dataset/2019/Quarter=4/*.xlsx")
Or, relies on known structure of Key=Value
df = spark.read.format("excel") \
.option("header", True) \
.option("inferSchema", True) \
.load(f"{ROOT}/src/test/resources/spreadsheets/ca_dataset/2019")
df.printSchema()
df.groupBy("Quarter").count().show()
Got
root
|-- Day: integer (nullable = true)
|-- Month: integer (nullable = true)
|-- Customer ID: string (nullable = true)
|-- Customer Name: string (nullable = true)
|-- Standard Package: integer (nullable = true)
|-- Extra Option 1: integer (nullable = true)
|-- Extra Option 2: integer (nullable = true)
|-- Extra Option 3: integer (nullable = true)
|-- Staff: string (nullable = true)
|-- Quarter: integer (nullable = true)
+-------+-----+
|Quarter|count|
+-------+-----+
| 1| 7|
| 3| 186|
| 4| 224|
| 2| 161|
+-------+-----+
With Quarter
column and its values are populated from directory structure.
- #74 Allow reading multiple files specified as a list OR by a pattern
- #97 Reading multiple files
- #185 Cannot read/load folder in spark
- #230 Can you support reading directories?
- #371 Unable to Partition while saving as Excel file
- #365 Add option for pathGlobFilter and partitioning
- #396 Not able to read multiple xlsx files together