Releases: GoogleCloudDataproc/hadoop-connectors
2020-04-02 (GCS 2.1.2, BQ 1.1.2)
Changelog
Cloud Storage connector:
- Update all dependencies to latest versions.
Big Query connector:
- Update all dependencies to latest versions.
2020-03-11 (GCS 2.1.1, BQ 1.1.1)
Changelog
Cloud Storage connector:
-
Add upload cache to support high-level retries of failed uploads. Cache size configured via property and disabled by default (zero or negative value):
fs.gs.outputstream.upload.cache.size (deafult: 0)
Big Query connector:
- Fix shaded jar - add back missing relocated dependencies.
2020-03-09 (GCS 2.1.0, BQ 1.1.0)
Changelog
Cloud Storage connector:
-
Update all dependencies to latest versions.
-
Use
storage.googleapis.com
API endpoint. -
Fix proxy authentication when using
JAVA_NET
transport. -
Remove Log4j backend for Google Flogger.
-
Add properties to override Google Cloud API endpoints:
fs.gs.storage.root.url fs.gs.token.server.url
-
Support adding custom HTTP headers to Cloud Storage API requests:
fs.gs.storage.http.headers.<HEADER>=<VALUE> (not set by default)
Example:
fs.gs.storage.http.headers.some-custom-header=custom_value fs.gs.storage.http.headers.another-custom-header=another_custom_value
-
Always set
generation
parameter for read requests and remove
fs.gs.generation.read.consistency
property. -
Always use URI path encoding and remove
fs.gs.path.encoding
property. -
Use Slf4j backend by default for Google Flogger.
-
Remove list requests caching in the
PerformanceCachingGoogleCloudStorage
andfs.gs.performance.cache.list.caching.enable
property. -
Stop caching non-existent (not found) items in performance cache.
Big Query connector:
-
Update all dependencies to latest versions.
-
Use
bigquery.googleapis.com
API endpoint. -
Fix proxy authentication when using
JAVA_NET
transport. -
Remove Log4j backend for Google Flogger.
-
Add properties to override Google Cloud API endpoints:
mapred.bq.bigquery.root.url mapred.bq.token.server.url
-
Use Slf4j backend by default for Google Flogger.
2020-02-13 (GCS 2.0.1, BQ 1.0.1)
Changelog
Cloud Storage connector:
-
Cooperative Locking FSCK tool: fix recovery of operations that failed before creating an operation log file.
-
Change Gson dependency scope from provided to compile in gcsio library.
Big Query connector:
- Fix shaded jar - add back missing relocated dependencies.
2019-08-23 (GCS 2.0.0, BQ 1.0.0)
Changelog
Cloud Storage connector:
-
Remove Hadoop 1.x support.
-
Do not convert path to directory path for inferred implicit directories.
-
Do not parallelize GCS list requests, because it leads to too high QPS.
-
Fix bug when GCS connector lists all files in directory instead of specified limit.
-
Eagerly initialize
GoogleCloudStorageReadChannel
metadata iffs.gs.inputstream.fast.fail.on.not.found.enable
set to true. -
Add support for Hadoop Delegation Tokens (based on HADOOP-14556). Configurable via
fs.gs.delegation.token.binding
property. -
Remove obsolete
fs.gs.file.size.limit.250gb
property. -
Repair implicit directories during delete and rename operations instead of list and glob operations.
-
Log HTTP
429 Too Many Requests
responses from GCS at 1 per 10 seconds rate. -
Remove obsolete
fs.gs.create.marker.files.enable
property. -
Remove system bucket feature and related properties:
fs.gs.system.bucket fs.gs.system.bucket.create
-
Remove obsolete
fs.gs.performance.cache.dir.metadata.prefetch.limit
property. -
Add a property to parallelize GCS requests in
getFileStatus
andlistStatus
methods to reduce latency:fs.gs.status.parallel.enable (default: false)
Setting this property to
true
will cause GCS connector to send more GCS requests which will decrease latency but also increase cost ofgetFileStatus
andlistStatus
method calls. -
Add a property to enable GCS direct upload:
fs.gs.outputstream.direct.upload.enable (default: false)
-
Update all dependencies to latest versions.
-
Support Cooperative Locking for directory operations:
fs.gs.cooperative.locking.enable (default: false) fs.gs.cooperative.locking.expiration.timeout.ms (default: 120,000) fs.gs.cooperative.locking.max.concurrent.operations (default: 20)
-
Add FSCK tool for recovery of failed Cooperative Locking for directory operations:
hadoop jar /usr/lib/hadoop/lib/gcs-connector.jar \ com.google.cloud.hadoop.fs.gcs.CoopLockFsck \ --{check,rollBack,rollForward} gs://<bucket_name> [all|<operation_id>]
-
Implement Hadoop File System
append
method using GCS compose API. -
Disable support for reading GZIP encoded files (HTTP header
Content-Encoding: gzip
) because processing of GZIP encoded files is inefficient and error-prone in Hadoop and Spark.This feature is configurable with the property:
fs.gs.inputstream.support.gzip.encoding.enable (default: false)
-
Remove parent directory timestamp update feature and related properties:
fs.gs.parent.timestamp.update.enable fs.gs.parent.timestamp.update.substrings.excludes fs.gs.parent.timestamp.update.substrings.includes
This feature was enabled by default only for job history files, but it's not necessary anymore for Job History Server to work properly after MAPREDUCE-7101.
BigQuery connector:
-
Remove Hadoop 1.x support.
-
Remove deprecated features and associated properties:
mapred.bq.input.query mapred.bq.query.results.table.delete mapred.bq.input.sharded.export.enable
-
Remove obsolete
mapred.bq.output.async.write.enabled
property. -
Support nested record type in field schema in BigQuery connector.
-
Remove dependency on GCS connector code.
-
Add a property to specify BigQuery tables partitioning definition:
mapred.bq.output.table.partitioning
-
Add a new
DirectBigQueryInputFormat
for processing data through BigQuery Storage API.This input format is configurable via properties:
mapred.bq.input.sql.filter mapred.bq.input.selected.fields mapred.bq.input.skew.limit
-
Update all dependencies to latest versions.
-
Add a property to control max number of attempts when polling for next file. By default max number of attempts is unlimited (
-1
value):mapred.bq.dynamic.file.list.record.reader.poll.max.attempts (default: -1)
-
Add a property to specify output table create disposition:
mapred.bq.output.table.createdisposition (default: CREATE_IF_NEEDED)
2019-07-01 (GCS 2.0.0-RC2, BQ 1.0.0-RC2)
v2.0.0-RC2 Release GCS connector 2.0.0-RC2 and BQ connector 1.0.0-RC2
2019-06-28 (GCS 2.0.0-RC1, BQ 1.0.0-RC1)
v2.0.0-RC1 Release GCS connector 2.0.0-RC1 and BQ connector 1.0.0-RC1.
2019-05-15 (GCS 1.9.17, BQ 0.13.17)
Changelog
Cloud Storage connector:
-
Add a property to parallelize GCS requests in
getFileStatus
andlistStatus
methods to reduce latency:fs.gs.status.parallel.enable (default: false)
Setting this property to
true
will cause GCS connector to send more GCS requests which will decrease latency but also increase cost ofgetFileStatus
andlistStatus
method calls.
BigQuery connector:
-
POM updates for GCS connector 1.9.17.
-
Support nested record type in field schema in BigQuery connector.
-
Add a property to specify BigQuery tables partitioning definition:
mapred.bq.output.table.partitioning
2019-02-25 (GCS 1.9.16, BQ 0.13.16)
Changelog
Cloud Storage connector:
- Fix bug when GCS connector lists all files in directory instead of specified limit.
- Eagerly initialize
GoogleCloudStorageReadChannel
metadata iffs.gs.inputstream.fast.fail.on.not.found.enable
set totrue
.
BigQuery connector:
- POM updates for GCS connector 1.9.16.
2019-02-21 (GCS 1.9.15, BQ 0.13.15)
Changelog
Cloud Storage connector:
- Do not convert path to directory path for inferred implicit directories.
- Do not parallelize GCS list requests, because it leads to too high QPS.
BigQuery connector:
- POM updates for GCS connector 1.9.15.