Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-6067][VL] [Part 3-1] Refactor: Rename VeloxColumnarWriteFilesExec to ColumnarWriteFilesExec #6403

Merged
merged 3 commits into from
Jul 19, 2024

Conversation

baibaichen
Copy link
Contributor

@baibaichen baibaichen commented Jul 11, 2024

What changes were proposed in this pull request?

(Fixes: #6067)

This PR Refactors Velox side code, rename VeloxColumnarWriteFilesExec to GlutenColumnarWriteFilesExec, move it to gluten-core, so that Clickhouse backend can use the same SparkPlan in the followup PR.

By supporting spark 3.4, Velox supports whole stage native write pipeline which is better than old implementation, clickhouse backend also adopt such implementation.

Major change 1

The only major difference between velox and clichouse is how to parse native metrics. which I introduce a new trait called BackendWrite, it only has one member now. Once native write pipeline is compeleted, we get it by BackendsApiManager.getSparkPlanExecApiInstance.createBackendWrite, Please see VeloxBackendWrite for details

trait BackendWrite {
  def collectNativeWriteFilesMetrics(batch: ColumnarBatch): Option[WriteTaskResult]
}

Minor change 2

The other minor diffierence is clickhose backend doesn't generate filename. To compute filename per task, it uses HadoopMapReduceCommitProtocol::getFilename, and then injects them to backend. This is ok because Velox doesn't support maxRecordsPerFile, see #4329 and clickhouse backend also follow this, which means one task only produce one file, no need more injections.

Improve

I also pass File Format to backed.

How was this patch tested?

Uisng Existed UTs.

Copy link

#6067

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

2 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@baibaichen baibaichen changed the title [GLUTEN-6067][CH] [Part 3] [WIP] [GLUTEN-6067][CH] [Part 3-1] Refactor: Rename VeloxColumnarWriteFilesExec to GlutenColumnarWriteFilesExec Jul 18, 2024
@baibaichen baibaichen changed the title [GLUTEN-6067][CH] [Part 3-1] Refactor: Rename VeloxColumnarWriteFilesExec to GlutenColumnarWriteFilesExec [GLUTEN-6067][VL] [Part 3-1] Refactor: Rename VeloxColumnarWriteFilesExec to GlutenColumnarWriteFilesExec Jul 18, 2024
Copy link

Run Gluten Clickhouse CI


/**
* This RDD is used to make sure we have injected staging write path before initializing the native
* plan, and support Spark file commit protocol.
*/
class VeloxColumnarWriteFilesRDD(
class GlutenColumnarWriteFilesRDD(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After moving VeloxColumnarWriteFilesExec from backend-velox to gluten-core, can we update the class names by renaming GlutenColumnarWriteFilesExec to ColumnarWriteFilesExec and GlutenColumnarWriteFilesRDD to ColumnarWriteFilesRDD?

…nd move it to gluten-core

1. Return GlutenColumnarWriteFilesExec at SparkPlanExecApi
2. Move SparkWriteFilesCommitProtocol to gluten-core
3. SparkWriteFilesCommitProtocol support getFilename from internal commiter
4. Remove supportTransformWriteFiles from BackendSettingsApi
5. injectWriteFilesTempPath with fileName
…tenColumnarWriteFilesRDD to ColumnarWriteFilesRDD
Copy link

Run Gluten Clickhouse CI

@baibaichen baibaichen changed the title [GLUTEN-6067][VL] [Part 3-1] Refactor: Rename VeloxColumnarWriteFilesExec to GlutenColumnarWriteFilesExec [GLUTEN-6067][VL] [Part 3-1] Refactor: Rename VeloxColumnarWriteFilesExec to ColumnarWriteFilesExec Jul 19, 2024
@baibaichen baibaichen merged commit 206e4be into apache:main Jul 19, 2024
42 checks passed
@baibaichen baibaichen deleted the feature/native_write branch July 19, 2024 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] Support CH backend with Spark 3.5.x
3 participants