Releases · GoogleCloudDataproc/hadoop-connectors

Add upload cache to support high-level retries of failed uploads. Cache size configured via property and disabled by default (zero or negative value):
```
fs.gs.outputstream.upload.cache.size (deafult: 0)
```

Big Query connector:

Fix shaded jar - add back missing relocated dependencies.

Assets 2

10 Mar 01:25

hongyegong

v2.1.0

1b4956d

2020-03-09 (GCS 2.1.0, BQ 1.1.0)

Changelog

Cloud Storage connector:

Update all dependencies to latest versions.
Use storage.googleapis.com API endpoint.
Fix proxy authentication when using JAVA_NET transport.
Remove Log4j backend for Google Flogger.
Add properties to override Google Cloud API endpoints:
```
fs.gs.storage.root.url
fs.gs.token.server.url
```

Support adding custom HTTP headers to Cloud Storage API requests:

fs.gs.storage.http.headers.<HEADER>=<VALUE> (not set by default)

Example:

fs.gs.storage.http.headers.some-custom-header=custom_value
fs.gs.storage.http.headers.another-custom-header=another_custom_value

Always set generation parameter for read requests and remove
fs.gs.generation.read.consistency property.
Always use URI path encoding and remove fs.gs.path.encoding property.
Use Slf4j backend by default for Google Flogger.
Remove list requests caching in the PerformanceCachingGoogleCloudStorage
and fs.gs.performance.cache.list.caching.enable property.
Stop caching non-existent (not found) items in performance cache.

Big Query connector:

Update all dependencies to latest versions.
Use bigquery.googleapis.com API endpoint.
Fix proxy authentication when using JAVA_NET transport.
Remove Log4j backend for Google Flogger.

Add properties to override Google Cloud API endpoints:

mapred.bq.bigquery.root.url
mapred.bq.token.server.url

Use Slf4j backend by default for Google Flogger.

Assets 2

15 Feb 03:53

hongyegong

v2.0.1

9d1e8af

2020-02-13 (GCS 2.0.1, BQ 1.0.1)

Changelog

Cloud Storage connector:

Cooperative Locking FSCK tool: fix recovery of operations that failed before creating an operation log file.
Change Gson dependency scope from provided to compile in gcsio library.

Big Query connector:

Fix shaded jar - add back missing relocated dependencies.

Assets 2

24 Aug 00:37

medb

v2.0.0

2fbaa04

2019-08-23 (GCS 2.0.0, BQ 1.0.0)

Changelog

Cloud Storage connector:

Remove Hadoop 1.x support.
Do not convert path to directory path for inferred implicit directories.
Do not parallelize GCS list requests, because it leads to too high QPS.
Fix bug when GCS connector lists all files in directory instead of specified limit.
Eagerly initialize GoogleCloudStorageReadChannel metadata if fs.gs.inputstream.fast.fail.on.not.found.enable set to true.
Add support for Hadoop Delegation Tokens (based on HADOOP-14556). Configurable via fs.gs.delegation.token.binding property.
Remove obsolete fs.gs.file.size.limit.250gb property.
Repair implicit directories during delete and rename operations instead of list and glob operations.
Log HTTP 429 Too Many Requests responses from GCS at 1 per 10 seconds rate.
Remove obsolete fs.gs.create.marker.files.enable property.
Remove system bucket feature and related properties:
```
fs.gs.system.bucket
fs.gs.system.bucket.create
```
Remove obsolete fs.gs.performance.cache.dir.metadata.prefetch.limit property.
Add a property to parallelize GCS requests in getFileStatus and listStatus methods to reduce latency:
```
fs.gs.status.parallel.enable (default: false)
```
Setting this property to true will cause GCS connector to send more GCS requests which will decrease latency but also increase cost of getFileStatus and listStatus method calls.

Add a property to enable GCS direct upload:

fs.gs.outputstream.direct.upload.enable (default: false)

Update all dependencies to latest versions.

Support Cooperative Locking for directory operations:

fs.gs.cooperative.locking.enable (default: false)
fs.gs.cooperative.locking.expiration.timeout.ms (default: 120,000)
fs.gs.cooperative.locking.max.concurrent.operations (default: 20)

Add FSCK tool for recovery of failed Cooperative Locking for directory operations:

hadoop jar /usr/lib/hadoop/lib/gcs-connector.jar \
    com.google.cloud.hadoop.fs.gcs.CoopLockFsck \
    --{check,rollBack,rollForward} gs://<bucket_name> [all|<operation_id>]

Implement Hadoop File System append method using GCS compose API.
Disable support for reading GZIP encoded files (HTTP header Content-Encoding: gzip) because processing of GZIP encoded files is inefficient and error-prone in Hadoop and Spark.

This feature is configurable with the property:
```
fs.gs.inputstream.support.gzip.encoding.enable (default: false)
```
Remove parent directory timestamp update feature and related properties:
```
fs.gs.parent.timestamp.update.enable
fs.gs.parent.timestamp.update.substrings.excludes
fs.gs.parent.timestamp.update.substrings.includes
```
This feature was enabled by default only for job history files, but it's not necessary anymore for Job History Server to work properly after MAPREDUCE-7101.

BigQuery connector:

Remove Hadoop 1.x support.

Remove deprecated features and associated properties:

mapred.bq.input.query
mapred.bq.query.results.table.delete
mapred.bq.input.sharded.export.enable

Remove obsolete mapred.bq.output.async.write.enabled property.
Support nested record type in field schema in BigQuery connector.
Remove dependency on GCS connector code.
Add a property to specify BigQuery tables partitioning definition:
```
mapred.bq.output.table.partitioning
```
Add a new DirectBigQueryInputFormat for processing data through BigQuery Storage API.

This input format is configurable via properties:
```
mapred.bq.input.sql.filter
mapred.bq.input.selected.fields
mapred.bq.input.skew.limit
```
Update all dependencies to latest versions.
Add a property to control max number of attempts when polling for next file. By default max number of attempts is unlimited (-1 value):
```
mapred.bq.dynamic.file.list.record.reader.poll.max.attempts (default: -1)
```

Add a property to specify output table create disposition:

mapred.bq.output.table.createdisposition (default: CREATE_IF_NEEDED)

Assets 2

01 Jul 20:27

medb

v2.0.0-RC2

12ebc0c

2019-07-01 (GCS 2.0.0-RC2, BQ 1.0.0-RC2) Pre-release

Pre-release

v2.0.0-RC2

Release GCS connector 2.0.0-RC2 and BQ connector 1.0.0-RC2

Assets 2

28 Jun 20:28

medb

v2.0.0-RC1

2fa0f81

2019-06-28 (GCS 2.0.0-RC1, BQ 1.0.0-RC1) Pre-release

Pre-release

v2.0.0-RC1

Release GCS connector 2.0.0-RC1 and BQ connector 1.0.0-RC1.

Assets 2

16 May 00:49

medb

v1.9.17

6fb7795

2019-05-15 (GCS 1.9.17, BQ 0.13.17)

Changelog

Cloud Storage connector:

Add a property to parallelize GCS requests in getFileStatus and listStatus methods to reduce latency:
```
fs.gs.status.parallel.enable (default: false)
```
Setting this property to true will cause GCS connector to send more GCS requests which will decrease latency but also increase cost of getFileStatus and listStatus method calls.

BigQuery connector:

POM updates for GCS connector 1.9.17.
Support nested record type in field schema in BigQuery connector.
Add a property to specify BigQuery tables partitioning definition:
```
mapred.bq.output.table.partitioning
```

Assets 2

25 Feb 23:24

medb

v1.9.16

6abbb16

2019-02-25 (GCS 1.9.16, BQ 0.13.16)

Changelog

Cloud Storage connector:

Fix bug when GCS connector lists all files in directory instead of specified limit.
Eagerly initialize GoogleCloudStorageReadChannel metadata if fs.gs.inputstream.fast.fail.on.not.found.enable set to true.

BigQuery connector:

POM updates for GCS connector 1.9.16.

Assets 2

21 Feb 19:23

medb

v1.9.15

c2c14e3

2019-02-21 (GCS 1.9.15, BQ 0.13.15)

Changelog

Cloud Storage connector:

Do not convert path to directory path for inferred implicit directories.
Do not parallelize GCS list requests, because it leads to too high QPS.

BigQuery connector:

POM updates for GCS connector 1.9.15.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

Cloud Storage connector:

Big Query connector:

Changelog

Cloud Storage connector:

Big Query connector:

Changelog

Cloud Storage connector:

Big Query connector:

Changelog

Cloud Storage connector:

Big Query connector:

Changelog

Cloud Storage connector:

BigQuery connector:

Changelog

Cloud Storage connector:

BigQuery connector:

Changelog

Cloud Storage connector:

BigQuery connector:

Changelog

Cloud Storage connector:

BigQuery connector:

Releases: GoogleCloudDataproc/hadoop-connectors

2020-04-02 (GCS 2.1.2, BQ 1.1.2)

Changelog

Cloud Storage connector:

Big Query connector:

2020-03-11 (GCS 2.1.1, BQ 1.1.1)

Changelog

Cloud Storage connector:

Big Query connector:

2020-03-09 (GCS 2.1.0, BQ 1.1.0)

Changelog

Cloud Storage connector:

Big Query connector:

2020-02-13 (GCS 2.0.1, BQ 1.0.1)

Changelog

Cloud Storage connector:

Big Query connector:

2019-08-23 (GCS 2.0.0, BQ 1.0.0)

Changelog

Cloud Storage connector:

BigQuery connector:

2019-07-01 (GCS 2.0.0-RC2, BQ 1.0.0-RC2)

2019-06-28 (GCS 2.0.0-RC1, BQ 1.0.0-RC1)

2019-05-15 (GCS 1.9.17, BQ 0.13.17)

Changelog

Cloud Storage connector:

BigQuery connector:

2019-02-25 (GCS 1.9.16, BQ 0.13.16)

Changelog

Cloud Storage connector:

BigQuery connector:

2019-02-21 (GCS 1.9.15, BQ 0.13.15)

Changelog

Cloud Storage connector:

BigQuery connector: