07 Oct 08:56

vsethi09

2427a05

CDAP 6.7.2

Enhancements
CDAP-19601: For new Dataproc compute profiles, changed the default value of Master Machine Type and Worker Machine Type from n2 to e2.

Bug Fixes
CDAP-19532: Fixed an issue in the Database Batch Source plugin that caused pipelines to fail during runtime when there was a column with precision of 0 in the source returned by JDBC. Now, if a column has a precision of 0, the pipeline no longer fails. This affected CDAP 6.7.1 only. Note: In the Database Batch Source, if a column has precision 0, you must change the data type to Double in the Output Schema to ensure the pipeline runs successfully.

PLUGIN-1373: In the BigQuery Sink plugin (version 0.20.3), fixed an issue that sometimes caused a NullPointerException error when trying to update table metrics.

PLUGIN-1367: In the BigQuery Sink plugin (version 0.20.3), fixed an issue that caused a NullPointerException error when the output schema was not defined.

PLUGIN-1361: In the Send Email batch pipeline alert, fixed an issue where emails failed to send when the Protocol was set to TLS.

Assets 2

20 Aug 01:02

sechegaray

v6.7.1

0704655

CDAP 6.7.1

Enhancements
CDAP-19050: Enhanced the Dataproc provisioner to avoid making unneeded Compute Engine calls depending on the configuration settings.

CDAP-18336: For new Dataproc compute profiles, changed the default value of Master Machine Type from n1 to n2.

Bug Fixes
CDAP-19381: Fixed an issue in CDAP that created duplicate entries in file cache map, which resulted in multiple attempts to delete the same cache file.

CDAP-19379:

Fixed an issue where the Log service left empty folders, which made the mounting of Persistent Disk slow. This caused the Log service to fail to start in a timely manner.

Fixed an issue that caused pipelines to take a long time to launch or get stuck. This was linked to I/O throttling that occurred on the underlying Persistent Disk.

CDAP-19366: Fixed an issue that caused pipelines to fail when two or more pipelines were scheduled to start simultaneously on a static Dataproc cluster. This was due to a file upload race condition.

CDAP-19353: Fixed an issue in flow control that caused Appfabric to return 5xx error code in rare scenarios instead of 429 (Too Many Requests Error) if the number of concurrently launching or running pipelines were above certain thresholds.

CDAP-19276: Fixed an issue that resulted in an error when a compute profile was exported from the default namespace after switching from a custom namespace.

CDAP-19216: Fixed an issue when you started a pipeline multiple times and then stopped the pipeline before it completed, which resulted in the following UI error: Program is not running.

CDAP-19211: Removed verbose logs from the BigQuery client libraries in pipeline logs.

PLUGIN-1256: Fixed an issue that caused the BigQuery Execute action plugin configured with an Encryption Key Name (CMEK) to fail when the SQL query contained DDL Statements.

PLUGIN-954: In the BigQuery Execute action plugin, added a property Store Results in a BigQuery Table in the UI, which hides the destination table related properties by default.

Assets 2

20 Aug 01:05

sechegaray

v6.7.0

b26eeb5

CDAP 6.7.0

New Features
General
Added support for mounting arbitrary volumes to CDAP system services in the CDAP operator.

Performance and Scalability
CDAP-19016: Increase pipeline run scalability.

CDAP-18837: Use system pods to enable horizontal scaling of pipeline launching. For more information, see System Workers.

Plugins
Google Dataplex Batch Source and Google Dataplex Sink system plugins are available in Preview.

Transformation Pushdown
Transformation Pushdown for joins is generally available (GA).

In Transformation Pushdown, Group By aggregation and Deduplicate aggregation are available in Preview.

CDAP-18437: Transformation Pushdown supports the BigQuery Storage Read API to improve performance when extracting data from BigQuery.

PLUGIN-1001: Added support for connections to Transformation Pushdown.

Wrangler
Added support to parse files before loading data into the Wrangler workspace. This means the recipe does not include parse directives. Now, when you create a pipeline from Wrangler, the source has the correct Format property.

Added support to allow users to import the schema for formats such as JSON and some AVRO files where schema inference is not possible before loading data into the Wrangler workspace.

Enhancements
PLUGIN-1245: In the Joiner transformation, renamed the Distribution Skewed Input Stage property to Skewed Input Stage. Changed UI label only.

PLUGIN-1118: In Google Cloud File Reader batch source and Amazon S3 batch source plugins, added the Enable Quoted Values property, which lets you treat content between quotes as a value.

PLUGIN-1107: In the Google Cloud Data Loss Prevention (DLP) Decrypt Transformation and Google Cloud Data Loss Prevention (DLP) Redact Transformation, added the Resource Location property, which lets you specify the resource location for the DLP Service. For more information, see Specifying processing locations | Data Loss Prevention Documentation | Google Cloud.

PLUGIN-1004, CDAP-18386: Improved connection management to allow users to edit connections. Removed option to view connections.

PLUGIN-984: Added support for connections to the following plugins:

CloudSQL MySQL batch source

CloudSQL MySQL sink

CloudSQL PostgreSQL batch source

CloudSQL PostgreSQL sink

PLUGIN-968: Added support for connections in the following sinks:

PLUGIN-965: In the GCS Done File Marker post-action plugin, added the Location property, which lets you have buckets and customer-managed encryption keys in locations that are not US locations.

PLUGIN-926, PLUGIN-939: In the BigQuery Execution Action plugin and the BigQuery Argument Setter action plugin, added support for the Dataset Project ID property, which is the Project ID of the dataset that stores the query results. It's required if the dataset is in a different project than the BigQuery job.

PLUGIN-731: In BigQuery sinks, added support for BigNumeric data type.

PLUGIN-670: In the BigQuery Table Batch Source, added the ability to query any temporary table in any project when you set the Enable querying views property to Yes. Previously, you could only query views.

PLUGIN-650: In Google Data Loss Prevention plugins, added support for templates from other projects.

CDAP-18982: Added a new pipeline state for when you manually stop a pipeline run: Stopping.

CDAP-18778: In the BigQuery Execute action plugin, added the ability to look up the drive scope for the service account to read from external tables created from the drive.

CDAP-18713: Added support for setting up workload identity in separate k8s namespaces.

CDAP-18655: Improved generic Database source plugin to correctly read decimal data.

CDAP-18556: Improved Google Cloud Platform plugins to validate the Encryption Key Name property.

CDAP-18456: In the replication configurations, added the ability to enable soft deletes from a BigQuery target.

CDAP-18405: Improved connection management to allow users to browse partial hierarchies like BigQuery datasets and Dataplex zones.

CDAP-18318: Permission checks are now required for updating/viewing system service information.

CDAP-17955: Replication assessment warnings no longer block draft deployment.

CDAP-16035: In Wrangler, added support for nested arrays, such as the BigQuery STRUCT data type.

In the Amazon S3 connection and Amazon S3 batch source plugins, added Session Token property.

In the Google Cloud Storage File Reader batch source plugin, added the Allow Empty Input property.

In the Joiner transformation, added the Input with Larger Data Skew property.

In the in Google Cloud Storage File Reader batch source plugin, Amazon S3 batch source plugin, and File batch source plugin, changed Skip Header property name to Use First Row as Header

Behavior Changes
CDAP-18990: In the Pipeline Studio, if you click Stop on a running pipeline, if the pipeline does not stop after 6 hours, the pipeline is forcefully terminated.

CDAP-18918: in the Deduplicate Analytics plugin, Limited the Filter Operation property to one record. If this property is not set, one random record will be chosen from the group of ‘duplicate’ records.

PLUGIN-795: The BigQuery sink supports Nullable Arrays. A NULL array gets converted to empty arrays at insertion time.

Wrangler no longer infers all values in CSV files as Strings. Instead, it maps the columns to a corresponding data type.

Bug Fixes
[PLUGIN-1210](https://c...

Assets 2

24 Feb 02:01

seanzhougoogle

v6.6.0

03a197a

CDAP 6.6.0

New Features
CDAP-18653: Added one-click autoscaling for Dataproc compute profiles.

Enhancements
PLUGIN-994: Added support for Fetch Size to the following plugins:

CloudSQL MySQL batch source

CloudSQL PostgreSQL batch source

PostgreSQL batch source

SQL Server batch source

Teradata batch source

CDAP-18738: Dataproc Cluster Reuse. Runtime property system.profile.properties.clusterReuseEnabled is no longer required to enable cluster reuse. Default Max Idle Time is set to 30 minutes to prevent accidental cluster leak.

CDAP-18725: Added more details for pipeline success and failure metrics.

CDAP-18712: Added ability to limit published lineage messages to a configurable size to avoid out of memory errors due to large lineages.

CDAP-18651: Preview runners no longer perform any kind of access enforcement.

CDAP-18647: Added new limit of 5000 records for Previewing data in the Pipeline Studio.

CDAP-18621: Added new default value of 30 minutes for the Dataproc profile Max Idle Time property. Previously, Max Idle Time had no default value.

CDAP-18836: Added temporary namespace UPDATE enforcement for pipeline connections.

CDAP-18798: Added system.program.starting.delay.seconds metric to measure time taken by program to transition from provisioning to running state.

CDAP-18714: Added metrics for API call latency.

CDAP-18725: Added new tags (Provisioner, Cluster Status, Existing Status) to existing program failure/success metric.

CDAP-17772: Added authn/z between internal system services via token verification.

Instance Stability and Memory Usage
CDAP-18696: Added new Applications parameter (app.max.concurrent.launching) to cdap-default.xml control back pressure on pipeline starting requests. Requests exceeding the limit will fail with 429 (Too Many Requests) status.

CDAP-18712: Added new Metadata parameter (metadata.messaging.publish.size.limit) to cdap-default.xml to limit the size of published lineage messages to avoid out of memory errors due to large lineages.

CDAP-18672: Added new Dataset parameter (data.storage.sql.scan.size.rows) to cdap-default.xml to set the number of rows fetched for database reads from PostgreSQL.

CDAP-18559, CDAP-17986: Added retries to Dataproc API calls to ensure transient errors don’t affect cluster provisioning.

CDAP-18594, CDAP-18810: Fixed a problem when pipeline could not be deleted due to program state not updated after retries.

CDAP-18857: Added new Applications parameter (app.artifact.parallelism.max) to cdap-default.xml that limits artifact repository initialization parallelism to prevent Out of Memory errors on App Fabric startup.

CDAP-18848: Reduced Metrics parameter (metrics.processor.queue.size) parameter default from 20000 to 1000 to prevent Out of Memory during metric processing.

CDAP-18791, CDAP-18627, CDAP-18553: Improved LevelDB performance and memory usage.

CDAP-18748, CDAP-18737, CDAP-18685, CDAP-18680: Improved running pipelines handling during App Fabric restarts.

CDAP-18656: Prevented App Fabric Out Of Memory error when it’s asked to retrieve a long list of pipelines within a namespace.

CDAP-18603: Added pagination to application list API.

CDAP-18586: Prevented App Fabric Out Of Memory when system argument list is too long.

Bug Fixes
PLUGIN-1035: Fixed an issue that caused pipelines to fail when a Database batch source included a decimal column with precision greater than 19.

PLUGIN-1022: Fixed an issue that caused pipelines with a Conditional plugin and running on MapReduce to fail.

PLUGIN-1015: Fixed an issue that caused pipelines with a Conditional plugin and running on Spark to fail.

PLUGIN-974: Fixed an issue that caused validation to fail for GCS Multi File sinks.

Behavior Changes
CDAP-18586: getApplicationSpecification() method in interface io.cdap.cdap.api.schedule.ProgramStatusTriggerInfo has been removed in CDAP 6.6.0, which can cause the CDAP build break if you are using this method.

Assets 2

02 Nov 20:55

greeshmaswaminathan

v6.5.1

6bb9292

CDAP 6.5.1

Enhancements

PLUGIN-883, PLUGIN-897: Added Encryption Key Name property to the following plugins so users can encrypt any new resources created by these plugins with Customer Managed Encryption Keys (CMEK):

Big Query Execute action
GCS Copy action
GCS Create action
GCS Move action
GCS Done File Marker Pipeline Alert
BigQuery Batch source
BigQuery Multi Table sink
BigQuery Table Sink
Google Cloud Storage sink
Google Cloud Storage Multi File sink
Google Cloud PubSub sink
Google Cloud Spanner sink
Transformation Pushdown to BigQuery

PLUGIN-898: Added Location property to GCS Copy and GCS Move action plugins to auto-create destination buckets if they do not exist before running the pipeline. Previously, the bucket had to exist before running the pipeline.

CDAP-18566: The File connection now browses the file system. For example, on a Hadoop cluster, the File connection now browses the HDFS file system. For CDAP Sandbox, the File connection still browses the local file system.

CDAP-18532: Added the following optional cdap-site.xml configs:

If a config router.block.request.enabled is true in conf, the request router should respond with a specific response (provided through config) to every user request, hence blocking all the user requests.

If a status code is provided using config router.block.request.status.code, the server should respond with this status code, the default value should be 503.

If a response message is provided using config router.block.request.message, the server should respond with this response body; otherwise the response body should be empty.

CDAP-18384: Added metrics for authorization in CDAP.

Bug Fixes

CDAP-18571: Fixed an issue where messages couldn’t be retrieved for Kafka topics. This broke in 6.5.0 and is now fixed in 6.5.1.

CDAP-18538, CDAP-184254: Fixed an issue where you couldn’t create a profile for an existing Dataproc cluster.

CDAP-18529: Fixed an issue that caused pipelines to fail when Transformation Pushdown was enabled and used macros as properties.

CDAP-18446: Fixed an issue that caused long running programs, like Replication, to fail within the default Hadoop delegation token timeout. Now, these tokens get renewed so that the job keeps running.

CDAP-18439: Fixed an issue in Replication that caused the Configure button to result in an error when you clicked it.

CDAP-18428: Fixed an issue that caused pipelines to fail with an Access Denied error when the pipeline had BigQuery plugins or Transformation Pushdown configuration that included a Dataset Project ID that was in a different project than the specified Project ID:

BigQuery sources
BigQuery sinks
BigQuery Multi Table sinks
Transformation Pushdown

The Access Denied error was due to missing permissions on the service account.

To ensure pipelines with BigQuery or BigQuery Multi Table sinks and pipelines with Transformation Pushdown enabled run successfully, assign the following roles to the Project ID service account:

BigQuery Job User role to run jobs
GCE Storage Bucket Admin role to create a temporary bucket

If the dataset is not in the same project that the BigQuery job will run in, the Dataset Project ID service account must be granted the following role to write data to a BigQuery dataset or table:

BigQuery Data Editor role

CDAP-18423: Fixed an issue in the GCS connection that prevented browsing and parsing files stored in folders under buckets.

CDAP-18335: Fixed an issue where the UI was unusable until an error displayed in the UI was closed by clicking the x icon.

CDAP-18318: Fixed an issue where users did not need permission to restart system services, reset system service log levels, get system service statuses, etc. Now, if authorization is enabled on the cluster, users will need to have the corresponding permissions for these system services in order to access them.

CDAP-18249: Fixed an issue where the Upload window didn’t close after uploading a user-defined directive due to missing properties in the user-defined directive json.

PLUGIN-899: Fixed an issue that caused custom formats to be unusable in the GCS source and sink.

Assets 2

31 Aug 19:13

greeshmaswaminathan

v6.5.0

2d945b3

CDAP 6.5.0

New Features

Connections

CDAP-17870: Added global connections for sources in Wrangler and data pipelines. For more information, see Managing Connections. Also added new endpoints for connections to the Pipeline Microservices.

CDAP-17924: Redesigned the Namespace Admin page.

Dataproc

CDAP-17999: Added support for labels in the Dataproc provisioner.

CDAP-17862: Added Shielded VMs as configuration settings for the Dataproc provisioner. For more information, see Google Dataproc.

CDAP-18004: Added support for running worker pods using different Kubernetes service accounts.

Namespaces

CDAP-17731: Added support to show current namespace name in the footer.

CDAP-17877, CDAP-17876: Added Connections and Drivers to Namespace Admin page for centralized management of all connections and Drivers. For more information, see JDBC Drivers and Managing Connections.

Spark 3

CDAP-17693: Added Spark 3 support for Standalone CDAP, CDAP Sandbox, and Previewing data.

CDAP-17930: Added Dataproc version to 2.0 as the default for new and upgraded pipelines. For more information, see “Upgrade Notes for Spark 3” below.

Transformation Pushdown

CDAP-17863: Added support for Transformation pushdown into BigQuery for Joiner transformations. For more information, see Using Transformation pushdown.

Improvements

CDAP-17730: Added authorization checks for preferences, logging, compute profiles, and metadata endpoints.

CDAP-17915: Added support to search for tables based on schema name when you select tables for a Replication job.

CDAP-17946: Improved error messages on the Pipeline List page.

CDAP-17973: Improved Wrangler error messages

CDAP-18024: Added support for running CDAP as a non-root user.

CDAP-18039: Added additional trace logging in the authorization flow for debugging.

CDAP-18146: Pods created by CDAP now inherit their ImagePullPolicy from the pod which created them.

CDAP-18194: Added support for BIGNUMERIC data type for BigQuery target in replication.

PLUGIN-764: Added support for Datetime data type for SQL Server batch source plugins.

PLUGIN-645: Added support for Datetime data type for Replication jobs.

Behavior Changes

CDAP-18114: MySQL, Oracle, PostgreSQL, and SQL Server batch sources, sinks, actions, and pipeline alerts are now installed by default as system plugins. Previously, these plugins were available in the Hub as user plugins.

CDAP-17898: When you use a connection in Wrangler and create a data pipeline, CDAP now creates a pipeline with the source plugin and then Wrangler transformation. In previous releases, CDAP created the pipeline with just the Wrangler transformation. You had to manually add the source plugin to the pipeline and configure it.

Bug Fixes

CDAP-17895: Fixed an issue in Replication that caused jobs to fail if more than 1000 tables are selected for replication.

CDAP-17919: Fixed an issue that caused replication jobs to hang when there were too many Delete or DDL events.

CDAP-17939: Improved the Messaging Service cleanup strategy so that it uses far fewer resources and cannot go out of memory.

CDAP-17942: Fixed an issue that caused plugin validation to fail when a macro is used within a macro function. For example: ${logicalStartTime(${date_format})}

CDAP-17943: Fixed an issue that caused pipelines with aggregations and Decimal fields to fail with an exception.

CDAP-17959: Fixed an issue that caused Wrangler to ignore all the other columns other than the given column when parsing Excel files.

CDAP-17965: For CDAP instances running on Kubernetes, fixed an issue that prevented new previews from being scheduled after the preview manager had been stopped 10 times.

CDAP-17995: Fixed Wrangler to fail pipelines upon error. In Wrangler 6.2 and above, there was a backwards incompatible change where pipelines did not fail if there was an error and instead were marked as completed.

CDAP-18002: Fixed an issue in Replication that caused jobs to fail when restarted during snapshotting.

CDAP-18012, CDAP-18003, CDAP-17853: Improved resilience of TMS.

CDAP-18060: Fixed an issue in CDAP Sandbox that caused Get Schema to fail when the source includes the Format field.

CDAP-18131: Fixed an issue where replication to BigQuery was failing because the source table had column names which are reserved keywords in the BigQuery.

PLUGIN-178: Fixed an issue while writing non-null values to a nullable field in BigQuery.

PLUGIN-635: Fixed an issue in the BigQuery plugins to correctly delete temporary GCS storage buckets.

PLUGIN-654: Fixed an issue that caused pipelines to fail when Pub/Sub source Subscription field was a macro.

PLUGIN-655: Fixed an issue in the BigQuery sink that caused failures when the input schema was not provided.

PLUGIN-669: Fixed Join Condition Type to be displayed in the Joiner for pipelines upgraded from versions before 6.4.0

PLUGIN-678: Fixed an issue in the BigQuery sink that caused pipelines to fail or give incorrect results.

PLUGIN-697: Fixed an issue that caused File Source Plugin validation to fail when there was a macro in the Format field.

Upgrade Notes for Spark 3

In CDAP 6.5.0, Spark 3 is the new default engine that will be used for Preview and running pipelines on Dataproc. Also Spark 1 support was removed from CDAP.

After an instance is upgraded to version 6.5.0, any new or upgraded pipeline that uses a Dataproc profile without an explicit image version will use the latest Dataproc image 2.0 that has Spark 3.1 bundled.

Any pipeline that was not upgraded will still use the original 1.3 Dataproc image that has Spark 2.3 bundled.

What does it mean for pipeline developers / operations?

Spark 3.1 provides a lot of improvements in different areas. See the release notes for Spark 3.0 and Spark 3.1. The main changes that affect backwards compatibility are:

Python 2 support is removed, any PySpark code must be Python 3 compatible.

Spark 3.1 uses Scala 2.12 that is binary incompatible with Scala 2.11. Most of the code is source compatible, so recompile Scala code with Scala 2.12 if you have any issues.

What does it mean for plugin developers?

If you use any Scala code, make sure it’s binary compatible with the corresponding Scala version: 2.12 for Spark 3 and 2.11 for Spark 2 execution environment.

This can be easily achieved by referencing the proper spark2_2.11 or spark3_2.12 version of the CDAP artifact, e.g. see [CDAP-17693] Introduce spark 3 for tests, drop spark 1, create spark … by tivv · Pull Request #1364 · cdapio/hydrator-plugins . Note that you must explicitly choose the version because artifacts without the version that were previously using Spark 1 are no longer available.

If you have any dependencies on Scala-specific artifacts (e.g. Kafka), change those as well.

The new Hadoop version used in dependencies is 2.6.0 instead of 2.3.0.

What to do in case of any problems?

Spark 2 is still fully supported by CDAP. If you use Dataproc, enter image version “1.3” in your provisioning profile and it will use exactly the same image CDAP 6.4 uses.

It’s highly recommended that you solve any problems found and migrate to the Spark 3 execution environment as it brings a number of enhancements including huge performance improvements.

Known Issues

Database connections

Although you can create connections for Database, MySQL, Oracle, PostgreSQL, and SQL Server sources, the plugin properties do not include Use Connection. This means that you cannot reference a connection in a database source plugin. However, from the Properties page in a database source plugin, you can select a connection to have CDAP populate the plugin properties with the connection properties.

To use the properties set in these connections in the corresponding batch source plugin, follow these steps:

In Pipeline Studio, add the source plugin to the canvas.

Click Properties.

Click Browse Database.
The Browse Database page appears with the available connections listed in the left panel.

Click the connection you want to use.

Locate the table you want to add to the source plugin and click it.
The source properties now include all of the properties from the connection.

Assets 2

27 May 02:03

rmstar

v6.4.1

2123718

CDAP 6.4.1

New Features

Replication

PLUGIN-645: BigQuery targets now support the Datetime data type.

Bug Fixes

CDAP-17943: Fixed an issue that caused pipelines with aggregations and Decimal fields to fail with an exception.

CDAP-17939: Improved the Messaging Service cleanup strategy so that it uses far fewer resources and cannot go out of memory.

CDAP-18012, CDAP-18003, CDAP-17853: Improved resilience of TMS.

PLUGIN-669: Fixed Join Condition Type to be displayed in the Joiner for pipelines upgraded from versions prior to 6.4.0

CDAP-17965: For CDAP instances running on Kubernetes, fixed an issue that prevented Previews from starting once a user has stopped a Preview run 10 times.

PLUGIN-178: Fixed an issue while writing non-null values to a nullable field in BigQuery.
PLUGIN-635: Fixed an issue in the BigQuery plugins to correctly delete temporary GCS buckets.

PLUGIN-655: Fixed an issue in the BigQuery sink that caused failures when the input schema was not provided.

PLUGIN-678: Fixed an issue in the BigQuery sink that caused pipelines to fail or give incorrect results.

PLUGIN-654: Fixed an issue that caused pipelines to fail when Pub/Sub source Subscription field was a macro.

Assets 2

25 Mar 01:39

rmstar

v6.4.0

1da9373

CDAP 6.4.0

New Features

Datetime Data Type

PLUGIN-615, PLUGIN-614: Added Datetime data type support to the following plugins:

BigQuery batch source
BigQuery sink
BigQuery Multi Table sink
Bigtable batch source
Bigtable sink
Datastore batch source
Datastore sink
GCS File batch source
GCS File sink
GCS Multi File sink
Spanner batch source
Spanner sink
File source
File sink
Wrangler
Amazon S3 batch source
Amazon S3 sink
Database source

Also, BigQuery datetime type can be directly mapped to CDAP datetime data type.

CDAP-17684, CDAP-17636: Added support for DateTime data type in Wrangler. You can now select Parse > Datetime to transform columns of strings to datetime values and Format > Datetime to change the date and time pattern of a column of datetime values.

Added new Wrangler directives that you can use in Power Mode to transform columns of strings to datetime values: Parse as Datetime, Current Datetime, Datetime to Timestamp, Format Datetime, Timestamp to Datetime

CDAP-17620: Added support Datetime logical data type in CDAP schema

Dataproc

CDAP-17622: Added machine type, cluster properties, and idle TTL as configuration settings for the dataproc provisioner. For more information, see Google Dataproc.

Security

CDAP-17709: Added support for PROXY authentication mode to nodejs proxy. CDAP UI now supports both MANAGED and PROXY modes of authentication. For more information, see Configuring Proxy Authentication Mode.

Pipeline Studio

CDAP-17549: Added support for data pipeline comments. For more information, see Adding comments to a data pipeline.

Plugin OAUTH Support

CDAP-17611: Updated Salesforce plugins to incorporate with the new OAuth macro function

CDAP-17610: Implemented a new macro function for OAuth token exchange

CDAP-17609: Implemented new HTTP endpoints for OAuth management

Replication

CDAP-17674: Added support to allow users to specify a runtime argument, retain.staging.table, to retain BigQuery staging table to help debug issues

CDAP-17595: Added upgrade support for replication jobs

CDAP-17471: Added the ability to duplicate, export, and import replication jobs

CDAP-17337: Added property to configure dataset name in the BigQuery replication target. By default, the dataset name is the same as the Replication source database name. For more information, see Google BigQuery Target.

CDAP-16755: Added ability to add the runtime argument "event.queue.capacity" to specify the capacity of the event queue in bytes for Replication jobs. If the target plugin consumes the event slower than the source plugin emits the event, the event may stay in the queue and occupy the memory. With this capability the user can control how much memory, at most, can be used for the event queue.

Kubernetes

CDAP-17618: Replaced Zookeeper for K8S CDAP setup with K8S secrets. For more information, see Prepare the secret token for authentication service.

CDAP-17466: Added Authentication functionality for CDAP on Kubernetes setup. For more information, see Installation on Kubernetes.

Joiner Analytics Plugin

CDAP-17607: Added advanced join conditions to the joiner plugin. This allows users to specify an arbitrary SQL condition to join on. These types of joins are typically much more costly to perform than basic join on equality. For more information, see Join Condition Type.

New System Plugins for Data Pipelines

PLUGIN-558: Added new post-action plugin, GCS Done File Marker. This post-action plugin marks the end of a pipeline run by creating and storing an empty DONE (or SUCCESS) file in the given GCS bucket upon a pipeline completion, success, or failure so that you can use it to orchestrate downstream/dependent processes.

Improvements

PLUGIN-601: Added a metric for bytes read from database source

PLUGIN-571: Added support to filter tables in the Multiple Database Tables Batch Source

PLUGIN-570: Improved error handling for Multiple Database Batch Sources and BigQuery multi-table sink that enables the pipelines to continue if one or more tables fail

CDAP-17724: Renamed replication pipelines to jobs

CDAP-17721: Added support for Kerberos login in K8s environment

CDAP-17675: Renamed Delete button to Remove in Replication Assessment report

CDAP-17670: Improved plugin initialization performance optimization

CDAP-17650: Added tag with parent artifact detail to Dataproc cluster created by CDAP

CDAP-17645: Set a timeout on the ssh connection so that the pipeline runs fails when the cluster becomes unreachable

CDAP-17642: Added namespace count to Dataplane metrics

CDAP-17621: Added the Customer Manager Encryption Key (CMEK) configuration property for replication BigQuery target. For more information, see Google BigQuery Replication Target.

CDAP-17613: Improved Replication Assessment page to highlight SQL Server tables with Schema issues in red

CDAP-17603: Added ability to jump to any step when modifying the Replication draft

CDAP-17601: Improved performance by loading data directly into the target table during replication snapshot process

CDAP-17597: Added poll metrics in Overview and Monitoring in Replication detail view

CDAP-17583: Improved Performance for Replication

CDAP-17582: Added ability to pass additional properties for Debezium and jdbc drivers for replication sources

CDAP-17482: Added ability to start Replication app from a last known checkpoint.

CDAP-17474: Added support for configuring elasticsearch TLS connection to trust all certs. For more information, see Elasticsearch.

CDAP-17414: Improved Replication Table selection user experience

CDAP-17289: Improved reliability of Pub/Sub Source plugin

CDAP-17248: Added File Encoding property to Amazon S3, File and GCS File Reader batch source plugins

CDAP-17114: Removed the record view in pipeline preview for the Joiner node because it was misleading

CDAP-16548: Renamed the Staging Bucket Location property to Location in the BigQuery Target properties page. For more information, see Google BigQuery Target.

CDAP-16623: Removed multiple way to collapse/expand the Connection menu

CDAP-16008: Added support for Kerberos Hadoop cluster in the Remote Hadoop Provisioner

CDAP-15552: Fixed Wrangler to highlight new column generated by a directive

Behavior Changes

CDAP-16180: Resolved macro to preferences during pipeline validation

In previous releases, when you validated a plugin, macros were not being resolved with preferences.

In CDAP 6.4.0, when you validate a plugin, macros now get resolved with preferences.

PLUGIN-470: Removed Multi sink runtime argument requirements, allowing users to add simple transformations in multi-source/multi-sink pipelines.

In previous releases, multi-sink plugins require the pipeline to set a runtime argument for each table, with the schema for each table.

In CDAP 6.4.0, CDAP determines the schema dynamically at runtime instead of requiring arguments to be set.

Bug Fixes

PLUGIN-610: Fixed Bigtable Batch Source plugin

PLUGIN-606: FTP batch source now works with empty File System Properties. See “Deprecations” below.

PLUGIN-545: Added support for strings in Min/Max aggregate functions (used in both Group By and Pivot plugins)

PLUGIN-539: Fixed Salesforce plugin to correctly parse the schema as Avro schema to make sure all the field names are accepted by Avro

PLUGIN-517: Fixed data pipeline with BigQuery sink that failed with INVALID_ARGUMENT exception if the range specified was a macro

PLUGIN-222: Fixed Kinesis Spark Streaming source

CDAP-17746: Fixed an issue in field validation logic in pipelines with BigQuery sink that caused a NullPointerException

CDAP-17744: Fixed Schema editor to show UI validations

CDAP-17737: Fixed Conditions plugins to work with Spark 3

CDAP-17732: Fixed the Wrangler Generate UUID directive to correctly generate a universally unique identifier (UUID) of the record

CDAP-17718: Fixed advanced joins to recognize auto broadcast setting

CDAP-17717: Fixed upgraded CDAP instances to include arrow to the Error Collector

CDAP-17713: Fixed Pipeline Studio UI to send null instead of string for blank plugin properties

CDAP-17703: Fixed Pipeline Studio to use current namespace when it fetches data pipeline drafts

CDAP-17691: Fixed SecureStore API to support SYSTEM namespace

CDAP-17683: Fixed million indicator on Replication Monitoring page

CDAP-17680: Fixed Replication statistics to display on the dashboard for SQL Server

CDAP-17678: Fixed an issue where clicking the Delete button on Replication Assessment page resulted in an error for the replication job

CDAP-17653: Removed the usage of authorization token while generating session token in nodejs proxy.

CDAP-17641: Schema name is now shown when selecting tables to replicate

CDAP-17635: Fixed Replication to correctly insert rows that were previous deleted by a replication job

CDAP-17630: Data pipelines running in Spark 3 enabled Dataproc cluster no longer fail with class not found exception

CDAP-17617: Fixed Replication Overview page to display the label of the table status when you hover over the table status

CDAP-17598: Added ability to hover over metrics in the Pipeline Summary page

CDAP-17591: Fixed Wrangler completion percentage

CDAP-17584: Fixed Replication with a SQL Server source to generate rows correctly in BigQuery target table if snapshot failed and restarted

CDAP-17570: Fixed an issue where SQL Server replication job stopped processing data when the connection was reset by the SQL Server

CDAP-17568: Fixed the Replication wizard to close without error when you click the X icon to exit

CDAP-17495: Fixed an error in Replication wizard Step 3 "Select tables, columns and events to replicate" where selecting no columns for a table caused the wizard to...

Assets 2

20 Jan 22:36

yaojiefeng

v6.3.0

0841442

CDAP 6.3.0

Summary

This release introduces a number of new features, improvements, and bug fixes to CDAP. The main highlight of the release is:

Replication
Added metrics for amount of data processed and error count from the replicator app.
Improved the replication UI page for better user experience.

New Features

CDAP-16835 - Added support for upgrading applications via REST API. Example usage is to upgrade all pipelines in a namespace to use the latest available artifacts.

CDAP-16836 - Added new options in CDAP CLI to take URI instead of host and port combination.

CDAP-16980 - New Log Viewer feature which enables users to see the most recent logs.

CDAP-17355 - Added Draft count metric and created Drafts API to manage drafts in the backend.

CDAP-17418 - This feature supports replicating those databases that have a "schema" concept. While "schema" is just a collection of DB objects.

CDAP-17460 - Redesign Replication Detail page.

CDAP-17461 - Redesign Dashboard page into Operations page.

Improvements

CDAP-16812 - Updated labels and descriptions for Service Account properties in the Dataproc provisioner.

CDAP-16815 - Added a metric records.updated in BigQuery sink. This will give a total of all the inserts, updates, and upserts into the sink.

CDAP-16918 - Introduced a new REST API for getting all application details across all namespaces.

CDAP-16929 - Added the ability to select a Custom Dataproc Image. The complete URI for the custom image should be specified.

CDAP-17015 - Updated Preview to show the number of Preview runs pending before current run (if there are any runs pending). The number of pending runs is shown under the timer in the UI.

CDAP-17065 - Disabled Spark YARN app retries since Spark already performs retries at a task level.

CDAP-17077 - Changed the auto-caching strategy in Spark pipelines to default to using disk only caching instead of memory due to common out of memory failures. Also changed the caching strategy to only cache at places that would prevent sources from being recomputed instead of the more aggressive caching previously done.

CDAP-17078 - Added a setting to consolidate multiple pipeline branches into single operations in Spark pipelines. This can improve performance in pipelines by avoiding recomputation. This can be turned on by setting a preference or runtime argument for 'spark.cdap.pipeline.consolidate.stages' to 'true'.

CDAP-17095 - Added Distribution to AutoJoiner API to increase performance for skewed joins.

CDAP-17123 - Made "records.updated" metric available for GCS Batch Sink plugin.

CDAP-17130 - Added Joiner Distribution support to MapReduce and streaming pipelines.

CDAP-17179 - Added new properties Filesystem properties and Output File Prefix for GCS Sink.

CDAP-17182 - Enable traffic compression in Runtime service.

CDAP-17198 - Added Runtime service to the system service statues.

CDAP-17202 - Improved commit performance for sinks.

CDAP-17249 - Added documentation about Regex Path Filter property to File and GCS sources.

CDAP-17389 - Added options for master and worker disk type and fixed the Dataproc provisioner to use the configured disk settings for secondary workers on autoscale clusters.

CDAP-17425 - Exposed the number of preview records requested to source plugins.

CDAP-17428 - Changed pipeline stage consolidation to be enabled by default. This improves the performance of certain types of pipelines.

CDAP-17439 - Added support for Hadoop 3 and Spark 3 for program execution.

CDAP-17462 - Delta source developers don't need to populate previous rows in the update event if the delta source supports row_id which is a unique identifier that can identify a row.

CDAP-17484 - Replication Assessment page now displays an error when a user selects two source tables with the same name to replicate, which is not supported

PLUGIN-282 - Added new Data Cacher plugin to allow users to manually cache data at certain points in a pipeline.

PLUGIN-303 - Added Distribution settings to Joiner plugin for increased performance in skewed joins.

Bug Fixes

CDAP-16797- CDAP UI now validates Pipeline Alerts before adding to the Pipeline Studio.

CDAP-16816 - Fixed schedule properties to overwrite preferences set on the application instead of the other way around. This most visibly fixed a bug where the compute profile set on a pipeline schedule or trigger would get overwritten by the profile for the pipeline.

CDAP-16824 - Fixed UI to show plugin properties for plugins that don't have a plugin widget.

CDAP-16845 - Fixed a bug that started running Preview for pipelines with post-run actions even if the user chooses the option to not run Preview.

CDAP-16870 - Fixed PySpark support to work with Spark 2.1.3+.

CDAP-16879 - For BigQuery sinks, if both Truncate Table and Update Table Schema are set to True, when you run the pipeline, only Truncate Table will be applied. Update Table Schema will be ignored.

CDAP-16880 - Removed schema validation from BQ sink when 'Truncate Table' option is set to True.

CDAP-16891 - Unsupported pipelines in drafts would be upgraded when users open them.

CDAP-16898 - Fixed a bug that did not fetch Preview data when the plugin label had spaces in it.

CDAP-16950 - Includes all ERROR level logs logged under the application logging context.

CDAP-16959 - Fixed an issue in Preview with runtime arguments re-rendering and losing focus when containing macros.

CDAP-16972 - Fixed an issue where Preview config would open when trying to stop a Preview.

CDAP-16975 - If there are multiple versions of a plugin, the latest version is now the default and is the version that gets added to pipelines. If the user has already chosen a specific version (older version), it defaults to that instead of the latest.

CDAP-16976 - UI resets the default version of plugins for specific users during upgrade. When users upgrade from 6.1.2 to 6.1.3 or later, UI will reset the default version of the plugin the user has already chosen. Post upgrade, if the user uses the same plugin, UI will choose the latest version of the same plugin.

CDAP-16993 - Fixed a bug in Preview for fields that have non-string types such as bytes.

CDAP-17000 - Changed default value of spark.network.timeout to 10 minutes to make pipeline execution more stable for shuffle heavy pipelines.

CDAP-17029 - Fixed an issue that caused an extra empty row to appear when sampling GCS text files in Wrangler.

CDAP-17043 - Fixed a bug for showing dropdown menu for Wrangler tabs to be correct. Existing dropdown overlapped with other UI elements hindering the usage of UI.

CDAP-17044 - Columns names are validated for BigQuery sink.

CDAP-17045 - Fixed the bug to allow large pipelines with - in the name to properly overflow in the UI.

CDAP-17057 - Fixed a bug that did not allow a user to make further changes to preferences when saving preferences returned an error.

CDAP-17059 - Added a check to fail pipeline deployment if there is an action in the middle of the pipeline.

CDAP-17074 - Improved state transitions for starting pipelines in app fabric to increase stability if app fabric unexpectedly restarts.

CDAP-17097 - Fixed a bug that caused Splitter transforms to be unable to fetch their output ports and schemas.

CDAP-17117 - Fixed styling bug so header of Preview tab does not scroll with table.

CDAP-17121 - Fixed a bug where Preview run fails on null values due to Json Encorder NullPointerException.

CDAP-17133 - Fixed tab styles for users on Mac with system preferences set to show scrollbars always in Chrome.

CDAP-17135 - Fixed a race condition in stopping Spark program in Standalone CDAP that can cause stop to hang.

CDAP-17137 - Fixed a bug that showed preview pipeline stopping in UI even when call to stop pipeline returns error.

CDAP-17138 - Fixed a bug that caused an empty error banner to appear when the user stops Preview.

CDAP-17139 - Fixed styling of Preview tab so that side by side tables and record tables are aligned.

CDAP-17140 - Fixed a bug so error banner for deploy failure shows failure details from backend status message, if they exist.

CDAP-17141 - Fixed bug that allowed a user to make unsaved config changes by disabling Pipeline Config button in Preview mode when run is in progress.

CDAP-17145 - Modified preview timer logic to use submitTime instead of pipeline run startTime, to take into account time spent in INIT and WAITING states.

CDAP-17161 - Reduce memory footprint for program execution monitoring.

CDAP-17166 - Fixed a bug that caused the setting for the number of executors in streaming pipelines to be ignored.

CDAP-17171 - Fixed horizontal tab styling to handle mac system setting "scrolling always on" in chrome.

CDAP-17172 - Fixed a bug that showed banner about stopping pipeline when a pipeline was deployed after running Preview.

CDAP-17174 - Fixed a bug that doesn't allow the user to stop Preview if pipeline run has already completed.

CDAP-17213 - Pick up Spark configuration correctly from the remote Hadoop cluster for program execution.

CDAP-17217 - Fixed overflow styling for long text in preview tables.

CDAP-17224 - Fixed an issue where the Dashboard page will show the graph being full when there is no run during the time period selected.

CDAP-17225 - Fixed a bug that caused pipeline deployment to fail if the pipeline contained comments.

CDAP-17233 - Improved Wrangler error messages for incorrect syntax and errors in Wrangler command line.

CDAP-17237 - Fixed a bug where the cluster's default Hadoop settings were not being used in pipelines.

CDAP-17239 - Fixed a bug in StandaloneMain which prematurely deletes the Authorizer classpath directories.

CDAP-17243 - Hide Analytics and Rules Engine by default from UI.

CDAP-17246 - Fixed pipeline exported in ...

Assets 2

23 Oct 23:16

CuriousVini

v6.2.3

8ed65f5

CDAP 6.2.3

Summary

This release contains critical bugfixes to the Dataproc provisioner in CDAP.

Bug Fixes

Fixed a bug in the Existing Dataproc provisioner that it checks for network unnecessarily (CDAP-17323)
Allow Dataproc provisioner to accept the default value of property gcp-dataproc.serviceAccount from cdap-site.xml. This property is to configure what service account a Dataproc cluster should use when running the pipeline. (CDAP-17326)

Assets 2

Releases: cdapio/cdap

CDAP 6.7.2

CDAP 6.7.1

CDAP 6.7.0

CDAP 6.6.0

CDAP 6.5.1

Enhancements

Bug Fixes

CDAP 6.5.0

New Features

Connections

Dataproc

Namespaces

Spark 3

Transformation Pushdown

Improvements

Behavior Changes

Bug Fixes

Upgrade Notes for Spark 3

What does it mean for pipeline developers / operations?

What does it mean for plugin developers?

What to do in case of any problems?

Known Issues

Database connections

CDAP 6.4.1

New Features

Replication

Bug Fixes

CDAP 6.4.0

New Features

Datetime Data Type

Dataproc

Security

Pipeline Studio

Plugin OAUTH Support

Replication

Kubernetes

Joiner Analytics Plugin

New System Plugins for Data Pipelines

Improvements

Behavior Changes

Bug Fixes

CDAP 6.3.0

Summary

New Features

Improvements

Bug Fixes

CDAP 6.2.3

Bug Fixes