Release 2021-01-07 (GCS 2.2.0, BQ 1.2.0) · GoogleCloudDataproc/hadoop-connectors

Changelog

Cloud Storage connector:

Delete deprecated methods.
Update all dependencies to latest versions.

Add support for Cloud Storage objects CSEK encryption:

fs.gs.encryption.algorithm (not set by default)
fs.gs.encryption.key (not set by default)
fs.gs.encryption.key.hash (not set by default)

Add a property to override storage service path:

fs.gs.storage.service.path (default: `storage/v1/`)

Added a new output stream type which can be used by setting:
```
fs.gs.outputstream.type=FLUSHABLE_COMPOSITE
```
The FLUSHABLE_COMPOSITE output stream type behaves similarly to the SYNCABLE_COMPOSITE type, except it also supports hflush(), which uses the same implementation with hsync() in the SYNCABLE_COMPOSITE output stream type.
Added a new output stream parameter
```
fs.gs.outputstream.sync.min.interval.ms (default: 0)
```
to configure the minimum time interval (milliseconds) between consecutive syncs. This is to avoid getting rate limited by GCS. Default is 0 - no wait between syncs. hsync() when rate limited will block on waiting for the permits, but hflush() will simply perform nothing and return.
Added a new parameter to configure output stream pipe type:
```
fs.gs.outputstream.pipe.type (default: IO_STREAM_PIPE)
```
Valid values are NIO_CHANNEL_PIPE and IO_STREAM_PIPE.

Output stream now supports (when property value set to NIO_CHANNEL_PIPE) Java NIO Pipe that allows to reliably write in the output stream from multiple threads without "Pipe broken" exceptions.

Note that when using NIO_CHANNEL_PIPE option maximum upload throughput can decrease by 10%.
Add a property to impersonate a service account:
```
fs.gs.auth.impersonation.service.account (not set by default)
```
If this property is set, an access token will be generated for this service account to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (roles/iam.serviceAccountTokenCreator) on the service account to impersonate.
Throw ClosedChannelException in GoogleHadoopOutputStream.write methods if stream already closed. This fixes Spark Streaming jobs checkpointing to Cloud Storage.
Add properties to impersonate a service account through user or group name:
```
fs.gs.auth.impersonation.service.account.for.user.<USER_NAME> (not set by default)
fs.gs.auth.impersonation.service.account.for.group.<GROUP_NAME> (not set by default)
```
If any of these properties are set, an access token will be generated for the service account associated with specified user name or group name in order to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (roles/iam.serviceAccountTokenCreator) on the service account to impersonate.
Fix complex patterns globbing.
Added support for an authorization handler for Cloud Storage requests. This feature is configurable through the properties:
```
fs.gs.authorization.handler.impl=<FULLY_QUALIFIED_AUTHORIZATION_HANDLER_CLASS>
fs.gs.authorization.handler.properties.<AUTHORIZATION_HANDLER_PROPERTY>=<VALUE>
```
If the fs.gs.authorization.handler.impl property is set, the specified authorization handler will be used to authorize Cloud Storage API requests before executing them. The handler will throw AccessDeniedException for rejected requests if user does not have enough permissions (not authorized) to execute these requests.

All properties with the fs.gs.authorization.handler.properties. prefix passed to an instance of the configured authorization handler class after instantiation before calling any Cloud Storage requests handling methods.
Set default value for fs.gs.status.parallel.enable property to true.
Tune exponential backoff configuration for Cloud Storage requests.
Increment Hadoop FileSystem.Statistics counters for read and write operations.
Always infer implicit directories and remove fs.gs.implicit.dir.infer.enable property.
Replace 2 glob-related properties (fs.gs.glob.flatlist.enable and fs.gs.glob.concurrent.enable`) with a single property to configure glob search algorithm:
```
fs.gs.glob.algorithm (default: CONCURRENT)
```
Do not create the parent directory objects (this includes buckets) when creating a new file or a directory, instead rely on the implicit directory inference.
Use default logging backend for Google Flogger instead of Slf4j.
Add FsBenchmark tool for benchmarking HCFS.
Remove obsolete fs.gs.inputstream.buffer.size property and related functionality.
Fix unauthenticated access support (fs.gs.auth.null.enable=true).
Improve cache hit ratio when fs.gs.performance.cache.enable property is set to true.

Remove obsolete configuration properties and related functionality:

fs.gs.auth.client.id
fs.gs.auth.client.file
fs.gs.auth.client.secret

Add a property that allows to disable HCFS semantic enforcement. If set to false GSC connector will not check if directory with same name already exists when creating a new file and vise versa.
```
fs.gs.create.items.conflict.check.enable (default: true)
```

Remove redundant properties:

fs.gs.config.override.file
fs.gs.copy.batch.threads
fs.gs.copy.max.requests.per.batch

Change default value of fs.gs.inputstream.min.range.request.size property from 524288 to 2097152.

Big Query connector:

Update all dependencies to latest versions.
Fix BigQuery job status retrieval in non-US locations.
Use default logging backend for Google Flogger instead of Slf4j.
Remove unused mapred.bq.output.buffer.size configuration property.
Fix unauthenticated access support (mapred.bq.auth.null.enable=true).

Remove obsolete configuration properties and related functionality:

mapred.bq.auth.client.id
mapred.bq.auth.client.file
mapred.bq.auth.client.secret

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2021-01-07 (GCS 2.2.0, BQ 1.2.0)

Changelog

Cloud Storage connector:

Big Query connector: