2019-08-23 (GCS 2.0.0, BQ 1.0.0)
Changelog
Cloud Storage connector:
-
Remove Hadoop 1.x support.
-
Do not convert path to directory path for inferred implicit directories.
-
Do not parallelize GCS list requests, because it leads to too high QPS.
-
Fix bug when GCS connector lists all files in directory instead of specified limit.
-
Eagerly initialize
GoogleCloudStorageReadChannel
metadata iffs.gs.inputstream.fast.fail.on.not.found.enable
set to true. -
Add support for Hadoop Delegation Tokens (based on HADOOP-14556). Configurable via
fs.gs.delegation.token.binding
property. -
Remove obsolete
fs.gs.file.size.limit.250gb
property. -
Repair implicit directories during delete and rename operations instead of list and glob operations.
-
Log HTTP
429 Too Many Requests
responses from GCS at 1 per 10 seconds rate. -
Remove obsolete
fs.gs.create.marker.files.enable
property. -
Remove system bucket feature and related properties:
fs.gs.system.bucket fs.gs.system.bucket.create
-
Remove obsolete
fs.gs.performance.cache.dir.metadata.prefetch.limit
property. -
Add a property to parallelize GCS requests in
getFileStatus
andlistStatus
methods to reduce latency:fs.gs.status.parallel.enable (default: false)
Setting this property to
true
will cause GCS connector to send more GCS requests which will decrease latency but also increase cost ofgetFileStatus
andlistStatus
method calls. -
Add a property to enable GCS direct upload:
fs.gs.outputstream.direct.upload.enable (default: false)
-
Update all dependencies to latest versions.
-
Support Cooperative Locking for directory operations:
fs.gs.cooperative.locking.enable (default: false) fs.gs.cooperative.locking.expiration.timeout.ms (default: 120,000) fs.gs.cooperative.locking.max.concurrent.operations (default: 20)
-
Add FSCK tool for recovery of failed Cooperative Locking for directory operations:
hadoop jar /usr/lib/hadoop/lib/gcs-connector.jar \ com.google.cloud.hadoop.fs.gcs.CoopLockFsck \ --{check,rollBack,rollForward} gs://<bucket_name> [all|<operation_id>]
-
Implement Hadoop File System
append
method using GCS compose API. -
Disable support for reading GZIP encoded files (HTTP header
Content-Encoding: gzip
) because processing of GZIP encoded files is inefficient and error-prone in Hadoop and Spark.This feature is configurable with the property:
fs.gs.inputstream.support.gzip.encoding.enable (default: false)
-
Remove parent directory timestamp update feature and related properties:
fs.gs.parent.timestamp.update.enable fs.gs.parent.timestamp.update.substrings.excludes fs.gs.parent.timestamp.update.substrings.includes
This feature was enabled by default only for job history files, but it's not necessary anymore for Job History Server to work properly after MAPREDUCE-7101.
BigQuery connector:
-
Remove Hadoop 1.x support.
-
Remove deprecated features and associated properties:
mapred.bq.input.query mapred.bq.query.results.table.delete mapred.bq.input.sharded.export.enable
-
Remove obsolete
mapred.bq.output.async.write.enabled
property. -
Support nested record type in field schema in BigQuery connector.
-
Remove dependency on GCS connector code.
-
Add a property to specify BigQuery tables partitioning definition:
mapred.bq.output.table.partitioning
-
Add a new
DirectBigQueryInputFormat
for processing data through BigQuery Storage API.This input format is configurable via properties:
mapred.bq.input.sql.filter mapred.bq.input.selected.fields mapred.bq.input.skew.limit
-
Update all dependencies to latest versions.
-
Add a property to control max number of attempts when polling for next file. By default max number of attempts is unlimited (
-1
value):mapred.bq.dynamic.file.list.record.reader.poll.max.attempts (default: -1)
-
Add a property to specify output table create disposition:
mapred.bq.output.table.createdisposition (default: CREATE_IF_NEEDED)