-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: BigQuery data plane description (#182)
* docs: BigQuery data plan description * chore: DEPENDENCIES * doc: added explanation about BigQuery named parameters and about use of single Part in the sink * chore: DEPENDENCIES
- Loading branch information
Showing
2 changed files
with
63 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
# BigQuery Data Plane | ||
|
||
This document describes the first implementation of the BigQuery data plane. | ||
|
||
The BigQuery data plane is implemented as a pipeline by `BigQueryDataSource` and ` BigQueryDataSink` classes. | ||
The data source, identified by a `BigQueryDataAddress`, expects an asset defined by a query statement that, executed on a BigQuery table, returns the data to be transferred. | ||
The data sink, also identified by a `BigQueryDataAddress`, receives the data from the source and transfers it without changes on a BigQuery destination table. | ||
|
||
## Data source | ||
|
||
The data source supports queries with [named parameters](https://cloud.google.com/bigquery/docs/parameterized-queries), that are similar to SQL query parameters. To set the values of the named parameters, the `BigQueryRequestParams` generated by `BigQueryRequestParamsProvider.provideSourceParams` (invoked from `BigQueryDataSourceFactory.createSource`) includes also the sink address passed when the transfer process is started by the consumer. The sink address must provide the values of the parameters in the query in the form: | ||
``` | ||
"dataDestination": { | ||
"type": "BigQueryData", | ||
"project": "consumerProject", | ||
"dataset": "destinationDataset", | ||
"table": "destinationTable", | ||
"@TYPE_parameterName": "parameterValue" | ||
} | ||
``` | ||
The TYPE is separated by the parameter name found in the query by underscore and can take values found in: | ||
[StandardSQLTypeName](https://cloud.google.com/java/docs/reference/google-cloud-bigquery/latest/com.google.cloud.bigquery.StandardSQLTypeName) | ||
|
||
When `BigQueryDataSource.openPartStream` is invoked, the data source creates the job to execute the query, a `PipedOutputStream` and a connected `PipedInputStream`: | ||
the output stream is passed to a newly created thread, and used to write the fetched rows as JSON entries: the rows are grouped in pages (paginated results) and sent as a JSON array | ||
the input stream instead is passed to a single `BigQueryPart` that is immediately returned | ||
The `openPartStream` method returns immediately, while the created thread fetches the results and streams the results as JSON data to the sink. | ||
The use of a single Part is driven by simplicity: using multiple parts would require to get the total number of returned rows / pages, creating the parts each with an output, and connecting each output to an input before returning. | ||
On top of that, the thread would need to maintain the list of output stream objects and then using each one page at time. | ||
|
||
## Data source error handling | ||
|
||
The data source is handling exceptions in the main body of `openPartStream` method and within the started thread: | ||
if an exception occurs in the `openPartStream` , the method returns `StreamResult.error` | ||
if an exception occurs in the fetching of the rows in the thread, then the output stream is closed, and the `BigQueryPart` object handed over to the sink is given the occurred exception via `BigQueryPart.setException` | ||
|
||
## Data sink | ||
|
||
The data sink receives the single `BigQueryPart` returned by the source, and starts reading the JSON entries representing the results rows, serialized by the thread started by the source itself. Once a results page is parsed, a `JSONArray` object is created with it and passed to a `JsonStreamWriter` object, using the BigQuery Storage API: | ||
https://cloud.google.com/bigquery/docs/write-api-streaming | ||
|
||
The reading continues till the thread created by the source closes the corresponding output stream. When the pipe is closed, the sink checks the `BigQueryPart`, to verify whether the stream was closed due to an error, by retrieving the stored exception with `BigQueryPart.getException`. | ||
|
||
## Data sink error handling | ||
|
||
If an error occurs while appending the data to the destination table, or an exception is found in the transferred `BigQueryPart`, the `BigQueryDataSink.transferParts` returns `StreamResult.failure`. | ||
|