Releases: GoogleCloudDataproc/hadoop-connectors
2018-04-12 (GCS 1.6.5, BQ 0.10.6)
Changelog
Cloud Storage connector:
-
Add support for using Cloud Storage Rewrite requests for copy operation:
fs.gs.copy.with.rewrite.enable (default: false)
This allows to copy files between different locations and storage classes.
-
Update all dependencies to latest versions.
-
Decrease default value for max requests per batch from 1,000 to 30.
-
Make max requests per batch value configurable with property:
fs.gs.max.requests.per.batch (default: 30)
BigQuery connector:
- Wire location through load, extract, and query jobs.
- Always require at least 2 partitions for sharded exports.
- Update all dependencies to latest versions.
- POM updates for GCS connector 1.6.5.
2018-03-29 (GCS 1.8.1, BQ 0.12.1)
Changelog
Cloud Storage connector:
-
Add
AUTO
mode support for Cloud Storage Requester Pays feature. -
Add support for using Cloud Storage Rewrite requests for copy operation:
fs.gs.copy.with.rewrite.enable (default: false)
This allows to copy files between different locations and storage classes.
BigQuery connector:
- Wire location through load, extract, and query jobs.
- Always require at least 2 partitions for sharded exports.
- POM updates for GCS connector 1.8.1.
2018-03-19 (GCS 1.6.4, BQ 0.10.5)
Changelog
Cloud Storage connector:
-
Fixed an issue where JSON auth files containing user auth (e.g.
application_default_credentials.json
) does not work withgoogle.cloud.auth.service.account.json.keyfile
. -
Honor
GOOGLE_APPLICATION_DEFAULT_CREDENTIALS
environment variable. For Google Application Default Credentials (but not other defaults). -
Make
fs.gs.project.id
optional. It is still required for listing buckets, creating buckets, and entire BigQuery connector. -
Disable GCS Metadata Cache by default (e.g. set default value of
fs.gs.metadata.cache.enable
property tofalse
). -
Support GCS Requester Pays feature that could be configured with new properties:
fs.gs.requester.pays.mode (default=DISABLED) fs.gs.requester.pays.project.id (no default value) fs.gs.requester.pays.buckets (no default value)
-
Add support for specifying marker files pattern that should be copied last during folder rename operation. Pattern is configured with property:
fs.gs.marker.file.pattern (no default value)
BigQuery connector:
- POM updates for GCS connector 1.6.4.
- Remove Avro and Gson classes from Hadoop 2 shaded jar because they are already included in the Hadoop 2 distribution.
2018-03-15 (GCS 1.8.0, BQ 0.12.0)
Changelog
Cloud Storage connector:
-
Support GCS Requester Pays feature that could be configured with new properties:
fs.gs.requester.pays.mode (default=DISABLED) fs.gs.requester.pays.project.id (no default value) fs.gs.requester.pays.buckets (no default value)
-
Change relocation package in shaded jar to be connector-specific.
-
Add support for specifying marker files pattern that should be copied last during folder rename operation. Pattern is configured with property:
fs.gs.marker.file.pattern (no default value)
-
Min required Java version now is Java 8.
BigQuery connector:
- POM updates for GCS connector 1.8.0.
- Change relocation package in shaded jar to be connector-specific.
- Min required Java version now is Java 8.
2018-02-22 (GCS 1.7.0, BQ 0.11.0)
Changelog
Cloud Storage connector:
- Fixed an issue where JSON auth files containing user auth (e.g.
application_default_credentials.json
) does not work withgoogle.cloud.auth.service.account.json.keyfile
. - Honor
GOOGLE_APPLICATION_DEFAULT_CREDENTIALS
environment variable. For Google Application Default Credentials (but not other defaults). - Make
fs.gs.project.id
optional. It is still required for listing buckets, creating buckets, and entire BigQuery connector. - Relocate all dependencies in shaded jar.
- Update all dependencies to latest versions.
- Disable GCS Metadata Cache by default (e.g. set default value of
fs.gs.metadata.cache.enable
property tofalse
).
BigQuery connector:
- Relocate all dependencies in shaded jar.
- Update all dependencies to latest versions.
- POM updates for GCS connector 1.7.0.
2018-01-25 (GCS 1.6.3, BQ 0.10.4)
Changelog
Cloud Storage connector:
- Use new GCS batch requests endpoint.
BigQuery connector:
- POM updates for GCS connector 1.6.3.
2017-11-21 (GCS 1.6.2, BQ 0.10.3)
Changelog
Cloud Storage connector:
- Wire HTTP transport settings into Credential logic.
BigQuery connector:
- POM updates for GCS connector 1.6.2.
2017-04-20 (GCS 1.6.1, BQ 0.10.2)
Changelog
Cloud Storage connector:
-
Added a polling loop when determining if a
createEmptyObjects
error can safely be ignored and expanded the cases in which we will attempt to determine if an empty object already exists.Previously, if a rate limiting exception was encountered while creating empty objects the connector would issue a single get request for that object. If the object exists and is zero length we would consider the
createEmptyObjects
call successful and suppress the rate limit exception.The new implementation will poll for the existence of the object, up to a user-configurable maximum, and will poll when either a rate limiting error occurs or when a 500-level error occurs. The maximum can be configured by the following setting:
fs.gs.max.wait.for.empty.object.creation.ms
Any positive value for this setting will be interpreted to mean "poll for up to this many milliseconds before making a final determination". The default value will cause a maximum wait of 3 seconds. Polling can be disabled by setting this key to 0.
BigQuery connector:
- POM updates for GCS connector 1.6.1.
2016-12-16 (GCS 1.6.0, BigQuery 0.10.1)
Changelog
Cloud Storage connector:
-
Added new
PerformanceCachingGoogleCloudStorage
; unlike the existingCacheSupplementedGoogleCloudStorage
which only serves as an advisory cache for enforcement of list consistency, the new optional caching layer is able to serving certain metadata and listing requests purely out of a short-lived in-memory cache to enhance performance of some workloads. By default this feature is disabled, and can be controlled with the config settings:fs.gs.performance.cache.enable=true (default=false) fs.gs.performance.cache.list.caching.enable=true (default=false)
The first option enables the cache to serve getFileStatus requests, while the second option additionally enables serving
listStatus
. The duration of cache entries can be controlled with:fs.gs.performance.cache.max.entry.age.ms (default=3000)
It is not recommended to always run with this feature enabled; it should be used specifically to address cases where frameworks perform redundant sequential list/stat operations in a non-distributed manner, and on datasets which are not frequently changing. It is additionally advised to validate data integrity separately whenever using this feature. There is no cooperative cache invalidation between different processes when using this feature, so concurrent mutations to a location from multiple clients will produce inconsistent/stale results if this feature is enabled.
BigQuery connector:
- Added a configurable write disposition when using
IndirectBigQueryOutputFormat
withWRITE_APPEND
as the default. - POM updates for GCS connector 1.6.0.
2016-11-07 (GCS 1.5.5, BigQuery 0.10.0)
Changelog
Cloud Storage connector:
- Minor refactoring of logic in
CacheSupplementedGoogleCloudStorage
to extract a reusableForwardingGoogleCloudStorage
that can be used for other GCS-delegating implementations.
BigQuery connector:
- Update output configuration keys to conform to the format in
BigQueryConfiguration
and haveBigQueryOutputConfiguration
handle the output path resolution and configuration. - POM updates for GCS connector 1.5.5.