2021-01-07 (GCS 2.2.0, BQ 1.2.0)
Changelog
Cloud Storage connector:
-
Delete deprecated methods.
-
Update all dependencies to latest versions.
-
Add support for Cloud Storage objects CSEK encryption:
fs.gs.encryption.algorithm (not set by default) fs.gs.encryption.key (not set by default) fs.gs.encryption.key.hash (not set by default)
-
Add a property to override storage service path:
fs.gs.storage.service.path (default: `storage/v1/`)
-
Added a new output stream type which can be used by setting:
fs.gs.outputstream.type=FLUSHABLE_COMPOSITE
The
FLUSHABLE_COMPOSITE
output stream type behaves similarly to theSYNCABLE_COMPOSITE
type, except it also supportshflush()
, which uses the same implementation withhsync()
in theSYNCABLE_COMPOSITE
output stream type. -
Added a new output stream parameter
fs.gs.outputstream.sync.min.interval.ms (default: 0)
to configure the minimum time interval (milliseconds) between consecutive syncs. This is to avoid getting rate limited by GCS. Default is
0
- no wait between syncs.hsync()
when rate limited will block on waiting for the permits, buthflush()
will simply perform nothing and return. -
Added a new parameter to configure output stream pipe type:
fs.gs.outputstream.pipe.type (default: IO_STREAM_PIPE)
Valid values are
NIO_CHANNEL_PIPE
andIO_STREAM_PIPE
.Output stream now supports (when property value set to
NIO_CHANNEL_PIPE
) Java NIO Pipe that allows to reliably write in the output stream from multiple threads without "Pipe broken" exceptions.Note that when using
NIO_CHANNEL_PIPE
option maximum upload throughput can decrease by 10%. -
Add a property to impersonate a service account:
fs.gs.auth.impersonation.service.account (not set by default)
If this property is set, an access token will be generated for this service account to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (
roles/iam.serviceAccountTokenCreator
) on the service account to impersonate. -
Throw
ClosedChannelException
inGoogleHadoopOutputStream.write
methods if stream already closed. This fixes Spark Streaming jobs checkpointing to Cloud Storage. -
Add properties to impersonate a service account through user or group name:
fs.gs.auth.impersonation.service.account.for.user.<USER_NAME> (not set by default) fs.gs.auth.impersonation.service.account.for.group.<GROUP_NAME> (not set by default)
If any of these properties are set, an access token will be generated for the service account associated with specified user name or group name in order to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (
roles/iam.serviceAccountTokenCreator
) on the service account to impersonate. -
Fix complex patterns globbing.
-
Added support for an authorization handler for Cloud Storage requests. This feature is configurable through the properties:
fs.gs.authorization.handler.impl=<FULLY_QUALIFIED_AUTHORIZATION_HANDLER_CLASS> fs.gs.authorization.handler.properties.<AUTHORIZATION_HANDLER_PROPERTY>=<VALUE>
If the
fs.gs.authorization.handler.impl
property is set, the specified authorization handler will be used to authorize Cloud Storage API requests before executing them. The handler will throwAccessDeniedException
for rejected requests if user does not have enough permissions (not authorized) to execute these requests.All properties with the
fs.gs.authorization.handler.properties.
prefix passed to an instance of the configured authorization handler class after instantiation before calling any Cloud Storage requests handling methods. -
Set default value for
fs.gs.status.parallel.enable
property totrue
. -
Tune exponential backoff configuration for Cloud Storage requests.
-
Increment Hadoop
FileSystem.Statistics
counters for read and write operations. -
Always infer implicit directories and remove
fs.gs.implicit.dir.infer.enable
property. -
Replace 2 glob-related properties (
fs.gs.glob.flatlist.enable
and fs.gs.glob.concurrent.enable`) with a single property to configure glob search algorithm:fs.gs.glob.algorithm (default: CONCURRENT)
-
Do not create the parent directory objects (this includes buckets) when creating a new file or a directory, instead rely on the implicit directory inference.
-
Use default logging backend for Google Flogger instead of Slf4j.
-
Add
FsBenchmark
tool for benchmarking HCFS. -
Remove obsolete
fs.gs.inputstream.buffer.size
property and related functionality. -
Fix unauthenticated access support (
fs.gs.auth.null.enable=true
). -
Improve cache hit ratio when
fs.gs.performance.cache.enable
property is set totrue
. -
Remove obsolete configuration properties and related functionality:
fs.gs.auth.client.id fs.gs.auth.client.file fs.gs.auth.client.secret
-
Add a property that allows to disable HCFS semantic enforcement. If set to
false
GSC connector will not check if directory with same name already exists when creating a new file and vise versa.fs.gs.create.items.conflict.check.enable (default: true)
-
Remove redundant properties:
fs.gs.config.override.file fs.gs.copy.batch.threads fs.gs.copy.max.requests.per.batch
-
Change default value of
fs.gs.inputstream.min.range.request.size
property from524288
to2097152
.
Big Query connector:
-
Update all dependencies to latest versions.
-
Fix BigQuery job status retrieval in non-US locations.
-
Use default logging backend for Google Flogger instead of Slf4j.
-
Remove unused
mapred.bq.output.buffer.size
configuration property. -
Fix unauthenticated access support (
mapred.bq.auth.null.enable=true
). -
Remove obsolete configuration properties and related functionality:
mapred.bq.auth.client.id mapred.bq.auth.client.file mapred.bq.auth.client.secret