Skip to content

Releases: GoogleCloudDataproc/hadoop-connectors

2020-04-02 (GCS 2.1.2, BQ 1.1.2)

03 Apr 01:19
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Update all dependencies to latest versions.

Big Query connector:

  1. Update all dependencies to latest versions.

2020-03-11 (GCS 2.1.1, BQ 1.1.1)

11 Mar 22:31
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Add upload cache to support high-level retries of failed uploads. Cache size configured via property and disabled by default (zero or negative value):

    fs.gs.outputstream.upload.cache.size (deafult: 0)
    

Big Query connector:

  1. Fix shaded jar - add back missing relocated dependencies.

2020-03-09 (GCS 2.1.0, BQ 1.1.0)

10 Mar 01:25
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Update all dependencies to latest versions.

  2. Use storage.googleapis.com API endpoint.

  3. Fix proxy authentication when using JAVA_NET transport.

  4. Remove Log4j backend for Google Flogger.

  5. Add properties to override Google Cloud API endpoints:

    fs.gs.storage.root.url
    fs.gs.token.server.url
    
  6. Support adding custom HTTP headers to Cloud Storage API requests:

    fs.gs.storage.http.headers.<HEADER>=<VALUE> (not set by default)
    

    Example:

    fs.gs.storage.http.headers.some-custom-header=custom_value
    fs.gs.storage.http.headers.another-custom-header=another_custom_value
    
  7. Always set generation parameter for read requests and remove
    fs.gs.generation.read.consistency property.

  8. Always use URI path encoding and remove fs.gs.path.encoding property.

  9. Use Slf4j backend by default for Google Flogger.

  10. Remove list requests caching in the PerformanceCachingGoogleCloudStorage
    and fs.gs.performance.cache.list.caching.enable property.

  11. Stop caching non-existent (not found) items in performance cache.

Big Query connector:

  1. Update all dependencies to latest versions.

  2. Use bigquery.googleapis.com API endpoint.

  3. Fix proxy authentication when using JAVA_NET transport.

  4. Remove Log4j backend for Google Flogger.

  5. Add properties to override Google Cloud API endpoints:

    mapred.bq.bigquery.root.url
    mapred.bq.token.server.url
    
  6. Use Slf4j backend by default for Google Flogger.

2020-02-13 (GCS 2.0.1, BQ 1.0.1)

15 Feb 03:53
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Cooperative Locking FSCK tool: fix recovery of operations that failed before creating an operation log file.

  2. Change Gson dependency scope from provided to compile in gcsio library.

Big Query connector:

  1. Fix shaded jar - add back missing relocated dependencies.

2019-08-23 (GCS 2.0.0, BQ 1.0.0)

24 Aug 00:37
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Remove Hadoop 1.x support.

  2. Do not convert path to directory path for inferred implicit directories.

  3. Do not parallelize GCS list requests, because it leads to too high QPS.

  4. Fix bug when GCS connector lists all files in directory instead of specified limit.

  5. Eagerly initialize GoogleCloudStorageReadChannel metadata if fs.gs.inputstream.fast.fail.on.not.found.enable set to true.

  6. Add support for Hadoop Delegation Tokens (based on HADOOP-14556). Configurable via fs.gs.delegation.token.binding property.

  7. Remove obsolete fs.gs.file.size.limit.250gb property.

  8. Repair implicit directories during delete and rename operations instead of list and glob operations.

  9. Log HTTP 429 Too Many Requests responses from GCS at 1 per 10 seconds rate.

  10. Remove obsolete fs.gs.create.marker.files.enable property.

  11. Remove system bucket feature and related properties:

    fs.gs.system.bucket
    fs.gs.system.bucket.create
    
  12. Remove obsolete fs.gs.performance.cache.dir.metadata.prefetch.limit property.

  13. Add a property to parallelize GCS requests in getFileStatus and listStatus methods to reduce latency:

    fs.gs.status.parallel.enable (default: false)
    

    Setting this property to true will cause GCS connector to send more GCS requests which will decrease latency but also increase cost of getFileStatus and listStatus method calls.

  14. Add a property to enable GCS direct upload:

    fs.gs.outputstream.direct.upload.enable (default: false)
    
  15. Update all dependencies to latest versions.

  16. Support Cooperative Locking for directory operations:

    fs.gs.cooperative.locking.enable (default: false)
    fs.gs.cooperative.locking.expiration.timeout.ms (default: 120,000)
    fs.gs.cooperative.locking.max.concurrent.operations (default: 20)
    
  17. Add FSCK tool for recovery of failed Cooperative Locking for directory operations:

    hadoop jar /usr/lib/hadoop/lib/gcs-connector.jar \
        com.google.cloud.hadoop.fs.gcs.CoopLockFsck \
        --{check,rollBack,rollForward} gs://<bucket_name> [all|<operation_id>]
    
  18. Implement Hadoop File System append method using GCS compose API.

  19. Disable support for reading GZIP encoded files (HTTP header Content-Encoding: gzip) because processing of GZIP encoded files is inefficient and error-prone in Hadoop and Spark.

    This feature is configurable with the property:

    fs.gs.inputstream.support.gzip.encoding.enable (default: false)
    
  20. Remove parent directory timestamp update feature and related properties:

    fs.gs.parent.timestamp.update.enable
    fs.gs.parent.timestamp.update.substrings.excludes
    fs.gs.parent.timestamp.update.substrings.includes
    

    This feature was enabled by default only for job history files, but it's not necessary anymore for Job History Server to work properly after MAPREDUCE-7101.

BigQuery connector:

  1. Remove Hadoop 1.x support.

  2. Remove deprecated features and associated properties:

    mapred.bq.input.query
    mapred.bq.query.results.table.delete
    mapred.bq.input.sharded.export.enable
    
  3. Remove obsolete mapred.bq.output.async.write.enabled property.

  4. Support nested record type in field schema in BigQuery connector.

  5. Remove dependency on GCS connector code.

  6. Add a property to specify BigQuery tables partitioning definition:

    mapred.bq.output.table.partitioning
    
  7. Add a new DirectBigQueryInputFormat for processing data through BigQuery Storage API.

    This input format is configurable via properties:

    mapred.bq.input.sql.filter
    mapred.bq.input.selected.fields
    mapred.bq.input.skew.limit
    
  8. Update all dependencies to latest versions.

  9. Add a property to control max number of attempts when polling for next file. By default max number of attempts is unlimited (-1 value):

    mapred.bq.dynamic.file.list.record.reader.poll.max.attempts (default: -1)
    
  10. Add a property to specify output table create disposition:

    mapred.bq.output.table.createdisposition (default: CREATE_IF_NEEDED)
    

2019-07-01 (GCS 2.0.0-RC2, BQ 1.0.0-RC2)

01 Jul 20:27
Compare
Choose a tag to compare
v2.0.0-RC2

Release GCS connector 2.0.0-RC2 and BQ connector 1.0.0-RC2

2019-06-28 (GCS 2.0.0-RC1, BQ 1.0.0-RC1)

28 Jun 20:28
Compare
Choose a tag to compare
v2.0.0-RC1

Release GCS connector 2.0.0-RC1 and BQ connector 1.0.0-RC1.

2019-05-15 (GCS 1.9.17, BQ 0.13.17)

16 May 00:49
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Add a property to parallelize GCS requests in getFileStatus and listStatus methods to reduce latency:

    fs.gs.status.parallel.enable (default: false)
    

    Setting this property to true will cause GCS connector to send more GCS requests which will decrease latency but also increase cost of getFileStatus and listStatus method calls.

BigQuery connector:

  1. POM updates for GCS connector 1.9.17.

  2. Support nested record type in field schema in BigQuery connector.

  3. Add a property to specify BigQuery tables partitioning definition:

    mapred.bq.output.table.partitioning
    

2019-02-25 (GCS 1.9.16, BQ 0.13.16)

25 Feb 23:24
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Fix bug when GCS connector lists all files in directory instead of specified limit.
  2. Eagerly initialize GoogleCloudStorageReadChannel metadata if fs.gs.inputstream.fast.fail.on.not.found.enable set to true.

BigQuery connector:

  1. POM updates for GCS connector 1.9.16.

2019-02-21 (GCS 1.9.15, BQ 0.13.15)

21 Feb 19:23
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Do not convert path to directory path for inferred implicit directories.
  2. Do not parallelize GCS list requests, because it leads to too high QPS.

BigQuery connector:

  1. POM updates for GCS connector 1.9.15.