Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading Excel with PERMISSIVE on EMR fails while it works locally (on Windows) #864

Closed
2 tasks done
christianknoepfle opened this issue Jun 14, 2024 · 3 comments
Closed
2 tasks done

Comments

@christianknoepfle
Copy link
Contributor

Am I using the newest version of the library?

  • I have made sure that I'm using the latest version of the library.

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

We have an excel that has a varying length of populated rows (excess values and missing values). So using PERMISSIVE and just ignore excess values and set missing values to null is fine.

On AWS EMR (6.14.0, spark 3.4.1) we see incosistent / random behavior. On some clusters it works, on others it fails with
24/06/14 09:36:15 WARN TaskSetManager: Lost task 6.0 in stage 0.0 (TID 6) (ip-10-107-10-248.eu-central-1.compute.internal executor 2): org.apache.spark.SparkException: Encountered error while reading file s3://somebucket/somefile Details: at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotReadFilesError(QueryExecutionErrors.scala:878) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:80) [...] Caused by: java.lang.ClassCastException: scala.Some cannot be cast to [Ljava.lang.Object; at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:74) at com.crealytics.spark.excel.v2.ExcelParser$.$anonfun$parseIterator$2(ExcelParser.scala:432) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at org.apache.spark.sql.execution.datasources.v2.PartitionReaderFromIterator.next(PartitionReaderFromIterator.scala:26) at org.apache.spark.sql.execution.datasources.v2.PartitionReaderWithPartitionValues.next(PartitionReaderWithPartitionValues.scala:48) at org.apache.spark.sql.execution.datasources.v2.PartitionedFileReader.next(FilePartitionReaderFactory.scala:58) at org.apache.spark.sql.execution.datasources.v2.FilePartitionReader.next(FilePartitionReader.scala:65) ... 49 more
Surprisingly It does not matter whether mode is FAILFAST or PERMISSIVE.

Cluster config is always the same, they just differ by hardware specs and number of workers

On my Windows machine I do not have a problem when using PERMISSIVE and with FAILFAST I get a good error message
Caused by: org.apache.spark.SparkException: [MALFORMED_RECORD_IN_PARSING] Malformed records are detected in record parsing: [Merkmale,null,null,null]. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.

Note: I am not using the official package of spark excel because this fat jar package does not work on EMR due to various reasons. The source code is the same, the included libs and versions are different.

Expected Behavior

I would expect that AWS EMR gives me comparable results to a local installation. I have no real idea why we see such a different behavior. Maybe due to distribution to mutliple machines the code is handled in a different way.

I root caused the whole thing down to v2.ExcelParser When I handle the bad record issue by myself (not throwing the NonFatal exception) and assume we are running PERMISSIVE mode it works on EMR. So basically something like this:

image

I would like to port something along these lines back to spark excel so my source code does not differ and I do not have to worry about that (still I have to do my own packaging, but that is not such a big deal). @nightscape Would you generally support such a change? In that case I would start working on a PR.

Any other/further thoughts on this issue?

Thanks

Christian

Steps To Reproduce

No response

Environment

- Spark version:
- Spark-Excel version:
- OS:
- Cluster environment

Anything else?

No response

@christianknoepfle
Copy link
Contributor Author

Ah I found #808 but IMO AWS EMR uses plain spark. And I was not on latest code :( I will check again and let you know the results

@christianknoepfle
Copy link
Contributor Author

It was my fault. Sorry for bothering

@nightscape
Copy link
Owner

No worries @christianknoepfle 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants