You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have made sure that I'm using the latest version of the library.
Is there an existing issue for this?
I have searched the existing issues
Current Behavior
We have an excel that has a varying length of populated rows (excess values and missing values). So using PERMISSIVE and just ignore excess values and set missing values to null is fine.
On AWS EMR (6.14.0, spark 3.4.1) we see incosistent / random behavior. On some clusters it works, on others it fails with 24/06/14 09:36:15 WARN TaskSetManager: Lost task 6.0 in stage 0.0 (TID 6) (ip-10-107-10-248.eu-central-1.compute.internal executor 2): org.apache.spark.SparkException: Encountered error while reading file s3://somebucket/somefile Details: at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:878) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:80) [...] Caused by: java.lang.ClassCastException: scala.Some cannot be cast to [Ljava.lang.Object; at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:74) at com.crealytics.spark.excel.v2.ExcelParser$.$anonfun$parseIterator$2(ExcelParser.scala:432) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at org.apache.spark.sql.execution.datasources.v2.PartitionReaderFromIterator.next(PartitionReaderFromIterator.scala:26) at org.apache.spark.sql.execution.datasources.v2.PartitionReaderWithPartitionValues.next(PartitionReaderWithPartitionValues.scala:48) at org.apache.spark.sql.execution.datasources.v2.PartitionedFileReader.next(FilePartitionReaderFactory.scala:58) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:65) ... 49 more
Surprisingly It does not matter whether mode is FAILFAST or PERMISSIVE.
Cluster config is always the same, they just differ by hardware specs and number of workers
On my Windows machine I do not have a problem when using PERMISSIVE and with FAILFAST I get a good error message Caused by: org.apache.spark.SparkException: [MALFORMED_RECORD_IN_PARSING] Malformed records are detected in record parsing: [Merkmale,null,null,null]. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
Note: I am not using the official package of spark excel because this fat jar package does not work on EMR due to various reasons. The source code is the same, the included libs and versions are different.
Expected Behavior
I would expect that AWS EMR gives me comparable results to a local installation. I have no real idea why we see such a different behavior. Maybe due to distribution to mutliple machines the code is handled in a different way.
I root caused the whole thing down to v2.ExcelParser When I handle the bad record issue by myself (not throwing the NonFatal exception) and assume we are running PERMISSIVE mode it works on EMR. So basically something like this:
I would like to port something along these lines back to spark excel so my source code does not differ and I do not have to worry about that (still I have to do my own packaging, but that is not such a big deal). @nightscape Would you generally support such a change? In that case I would start working on a PR.
Am I using the newest version of the library?
Is there an existing issue for this?
Current Behavior
We have an excel that has a varying length of populated rows (excess values and missing values). So using PERMISSIVE and just ignore excess values and set missing values to null is fine.
On AWS EMR (6.14.0, spark 3.4.1) we see incosistent / random behavior. On some clusters it works, on others it fails with
24/06/14 09:36:15 WARN TaskSetManager: Lost task 6.0 in stage 0.0 (TID 6) (ip-10-107-10-248.eu-central-1.compute.internal executor 2): org.apache.spark.SparkException: Encountered error while reading file s3://somebucket/somefile Details: at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:878) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:80) [...] Caused by: java.lang.ClassCastException: scala.Some cannot be cast to [Ljava.lang.Object; at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:74) at com.crealytics.spark.excel.v2.ExcelParser$.$anonfun$parseIterator$2(ExcelParser.scala:432) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at org.apache.spark.sql.execution.datasources.v2.PartitionReaderFromIterator.next(PartitionReaderFromIterator.scala:26) at org.apache.spark.sql.execution.datasources.v2.PartitionReaderWithPartitionValues.next(PartitionReaderWithPartitionValues.scala:48) at org.apache.spark.sql.execution.datasources.v2.PartitionedFileReader.next(FilePartitionReaderFactory.scala:58) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:65) ... 49 more
Surprisingly It does not matter whether mode is FAILFAST or PERMISSIVE.
Cluster config is always the same, they just differ by hardware specs and number of workers
On my Windows machine I do not have a problem when using PERMISSIVE and with FAILFAST I get a good error message
Caused by: org.apache.spark.SparkException: [MALFORMED_RECORD_IN_PARSING] Malformed records are detected in record parsing: [Merkmale,null,null,null]. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
Note: I am not using the official package of spark excel because this fat jar package does not work on EMR due to various reasons. The source code is the same, the included libs and versions are different.
Expected Behavior
I would expect that AWS EMR gives me comparable results to a local installation. I have no real idea why we see such a different behavior. Maybe due to distribution to mutliple machines the code is handled in a different way.
I root caused the whole thing down to
v2.ExcelParser
When I handle the bad record issue by myself (not throwing the NonFatal exception) and assume we are running PERMISSIVE mode it works on EMR. So basically something like this:I would like to port something along these lines back to spark excel so my source code does not differ and I do not have to worry about that (still I have to do my own packaging, but that is not such a big deal). @nightscape Would you generally support such a change? In that case I would start working on a PR.
Any other/further thoughts on this issue?
Thanks
Christian
Steps To Reproduce
No response
Environment
Anything else?
No response
The text was updated successfully, but these errors were encountered: