Releases: cdapio/cdap
CDAP 6.7.2
Enhancements
CDAP-19601: For new Dataproc compute profiles, changed the default value of Master Machine Type and Worker Machine Type from n2 to e2.
Bug Fixes
CDAP-19532: Fixed an issue in the Database Batch Source plugin that caused pipelines to fail during runtime when there was a column with precision of 0 in the source returned by JDBC. Now, if a column has a precision of 0, the pipeline no longer fails. This affected CDAP 6.7.1 only. Note: In the Database Batch Source, if a column has precision 0, you must change the data type to Double in the Output Schema to ensure the pipeline runs successfully.
PLUGIN-1373: In the BigQuery Sink plugin (version 0.20.3), fixed an issue that sometimes caused a NullPointerException error when trying to update table metrics.
PLUGIN-1367: In the BigQuery Sink plugin (version 0.20.3), fixed an issue that caused a NullPointerException error when the output schema was not defined.
PLUGIN-1361: In the Send Email batch pipeline alert, fixed an issue where emails failed to send when the Protocol was set to TLS.
CDAP 6.7.1
Enhancements
CDAP-19050: Enhanced the Dataproc provisioner to avoid making unneeded Compute Engine calls depending on the configuration settings.
CDAP-18336: For new Dataproc compute profiles, changed the default value of Master Machine Type from n1 to n2.
Bug Fixes
CDAP-19381: Fixed an issue in CDAP that created duplicate entries in file cache map, which resulted in multiple attempts to delete the same cache file.
Fixed an issue where the Log service left empty folders, which made the mounting of Persistent Disk slow. This caused the Log service to fail to start in a timely manner.
Fixed an issue that caused pipelines to take a long time to launch or get stuck. This was linked to I/O throttling that occurred on the underlying Persistent Disk.
CDAP-19366: Fixed an issue that caused pipelines to fail when two or more pipelines were scheduled to start simultaneously on a static Dataproc cluster. This was due to a file upload race condition.
CDAP-19353: Fixed an issue in flow control that caused Appfabric to return 5xx error code in rare scenarios instead of 429 (Too Many Requests Error) if the number of concurrently launching or running pipelines were above certain thresholds.
CDAP-19276: Fixed an issue that resulted in an error when a compute profile was exported from the default namespace after switching from a custom namespace.
CDAP-19216: Fixed an issue when you started a pipeline multiple times and then stopped the pipeline before it completed, which resulted in the following UI error: Program is not running.
CDAP-19211: Removed verbose logs from the BigQuery client libraries in pipeline logs.
PLUGIN-1256: Fixed an issue that caused the BigQuery Execute action plugin configured with an Encryption Key Name (CMEK) to fail when the SQL query contained DDL Statements.
PLUGIN-954: In the BigQuery Execute action plugin, added a property Store Results in a BigQuery Table in the UI, which hides the destination table related properties by default.
CDAP 6.7.0
New Features
General
Added support for mounting arbitrary volumes to CDAP system services in the CDAP operator.
Performance and Scalability
CDAP-19016: Increase pipeline run scalability.
CDAP-18837: Use system pods to enable horizontal scaling of pipeline launching. For more information, see System Workers.
Plugins
Google Dataplex Batch Source and Google Dataplex Sink system plugins are available in Preview.
Transformation Pushdown
Transformation Pushdown for joins is generally available (GA).
In Transformation Pushdown, Group By aggregation and Deduplicate aggregation are available in Preview.
CDAP-18437: Transformation Pushdown supports the BigQuery Storage Read API to improve performance when extracting data from BigQuery.
PLUGIN-1001: Added support for connections to Transformation Pushdown.
Wrangler
Added support to parse files before loading data into the Wrangler workspace. This means the recipe does not include parse directives. Now, when you create a pipeline from Wrangler, the source has the correct Format property.
Added support to allow users to import the schema for formats such as JSON and some AVRO files where schema inference is not possible before loading data into the Wrangler workspace.
Enhancements
PLUGIN-1245: In the Joiner transformation, renamed the Distribution Skewed Input Stage property to Skewed Input Stage. Changed UI label only.
PLUGIN-1118: In Google Cloud File Reader batch source and Amazon S3 batch source plugins, added the Enable Quoted Values property, which lets you treat content between quotes as a value.
PLUGIN-1107: In the Google Cloud Data Loss Prevention (DLP) Decrypt Transformation and Google Cloud Data Loss Prevention (DLP) Redact Transformation, added the Resource Location property, which lets you specify the resource location for the DLP Service. For more information, see Specifying processing locations | Data Loss Prevention Documentation | Google Cloud.
PLUGIN-1004, CDAP-18386: Improved connection management to allow users to edit connections. Removed option to view connections.
PLUGIN-984: Added support for connections to the following plugins:
CloudSQL PostgreSQL batch source
PLUGIN-968: Added support for connections in the following sinks:
PLUGIN-965: In the GCS Done File Marker post-action plugin, added the Location property, which lets you have buckets and customer-managed encryption keys in locations that are not US locations.
PLUGIN-926, PLUGIN-939: In the BigQuery Execution Action plugin and the BigQuery Argument Setter action plugin, added support for the Dataset Project ID property, which is the Project ID of the dataset that stores the query results. It's required if the dataset is in a different project than the BigQuery job.
PLUGIN-731: In BigQuery sinks, added support for BigNumeric data type.
PLUGIN-670: In the BigQuery Table Batch Source, added the ability to query any temporary table in any project when you set the Enable querying views property to Yes. Previously, you could only query views.
PLUGIN-650: In Google Data Loss Prevention plugins, added support for templates from other projects.
CDAP-18982: Added a new pipeline state for when you manually stop a pipeline run: Stopping.
CDAP-18778: In the BigQuery Execute action plugin, added the ability to look up the drive scope for the service account to read from external tables created from the drive.
CDAP-18713: Added support for setting up workload identity in separate k8s namespaces.
CDAP-18655: Improved generic Database source plugin to correctly read decimal data.
CDAP-18556: Improved Google Cloud Platform plugins to validate the Encryption Key Name property.
CDAP-18456: In the replication configurations, added the ability to enable soft deletes from a BigQuery target.
CDAP-18405: Improved connection management to allow users to browse partial hierarchies like BigQuery datasets and Dataplex zones.
CDAP-18318: Permission checks are now required for updating/viewing system service information.
CDAP-17955: Replication assessment warnings no longer block draft deployment.
CDAP-16035: In Wrangler, added support for nested arrays, such as the BigQuery STRUCT data type.
In the Amazon S3 connection and Amazon S3 batch source plugins, added Session Token property.
In the Google Cloud Storage File Reader batch source plugin, added the Allow Empty Input property.
In the Joiner transformation, added the Input with Larger Data Skew property.
In the in Google Cloud Storage File Reader batch source plugin, Amazon S3 batch source plugin, and File batch source plugin, changed Skip Header property name to Use First Row as Header
Behavior Changes
CDAP-18990: In the Pipeline Studio, if you click Stop on a running pipeline, if the pipeline does not stop after 6 hours, the pipeline is forcefully terminated.
CDAP-18918: in the Deduplicate Analytics plugin, Limited the Filter Operation property to one record. If this property is not set, one random record will be chosen from the group of ‘duplicate’ records.
PLUGIN-795: The BigQuery sink supports Nullable Arrays. A NULL array gets converted to empty arrays at insertion time.
Wrangler no longer infers all values in CSV files as Strings. Instead, it maps the columns to a corresponding data type.
Bug Fixes
[PLUGIN-1210](https://c...
CDAP 6.6.0
New Features
CDAP-18653: Added one-click autoscaling for Dataproc compute profiles.
Enhancements
PLUGIN-994: Added support for Fetch Size to the following plugins:
CloudSQL PostgreSQL batch source
CDAP-18738: Dataproc Cluster Reuse. Runtime property system.profile.properties.clusterReuseEnabled is no longer required to enable cluster reuse. Default Max Idle Time is set to 30 minutes to prevent accidental cluster leak.
CDAP-18725: Added more details for pipeline success and failure metrics.
CDAP-18712: Added ability to limit published lineage messages to a configurable size to avoid out of memory errors due to large lineages.
CDAP-18651: Preview runners no longer perform any kind of access enforcement.
CDAP-18647: Added new limit of 5000 records for Previewing data in the Pipeline Studio.
CDAP-18621: Added new default value of 30 minutes for the Dataproc profile Max Idle Time property. Previously, Max Idle Time had no default value.
CDAP-18836: Added temporary namespace UPDATE enforcement for pipeline connections.
CDAP-18798: Added system.program.starting.delay.seconds metric to measure time taken by program to transition from provisioning to running state.
CDAP-18714: Added metrics for API call latency.
CDAP-18725: Added new tags (Provisioner, Cluster Status, Existing Status) to existing program failure/success metric.
CDAP-17772: Added authn/z between internal system services via token verification.
Instance Stability and Memory Usage
CDAP-18696: Added new Applications parameter (app.max.concurrent.launching) to cdap-default.xml control back pressure on pipeline starting requests. Requests exceeding the limit will fail with 429 (Too Many Requests) status.
CDAP-18712: Added new Metadata parameter (metadata.messaging.publish.size.limit) to cdap-default.xml to limit the size of published lineage messages to avoid out of memory errors due to large lineages.
CDAP-18672: Added new Dataset parameter (data.storage.sql.scan.size.rows) to cdap-default.xml to set the number of rows fetched for database reads from PostgreSQL.
CDAP-18559, CDAP-17986: Added retries to Dataproc API calls to ensure transient errors don’t affect cluster provisioning.
CDAP-18594, CDAP-18810: Fixed a problem when pipeline could not be deleted due to program state not updated after retries.
CDAP-18857: Added new Applications parameter (app.artifact.parallelism.max) to cdap-default.xml that limits artifact repository initialization parallelism to prevent Out of Memory errors on App Fabric startup.
CDAP-18848: Reduced Metrics parameter (metrics.processor.queue.size) parameter default from 20000 to 1000 to prevent Out of Memory during metric processing.
CDAP-18791, CDAP-18627, CDAP-18553: Improved LevelDB performance and memory usage.
CDAP-18748, CDAP-18737, CDAP-18685, CDAP-18680: Improved running pipelines handling during App Fabric restarts.
CDAP-18656: Prevented App Fabric Out Of Memory error when it’s asked to retrieve a long list of pipelines within a namespace.
CDAP-18603: Added pagination to application list API.
CDAP-18586: Prevented App Fabric Out Of Memory when system argument list is too long.
Bug Fixes
PLUGIN-1035: Fixed an issue that caused pipelines to fail when a Database batch source included a decimal column with precision greater than 19.
PLUGIN-1022: Fixed an issue that caused pipelines with a Conditional plugin and running on MapReduce to fail.
PLUGIN-1015: Fixed an issue that caused pipelines with a Conditional plugin and running on Spark to fail.
PLUGIN-974: Fixed an issue that caused validation to fail for GCS Multi File sinks.
Behavior Changes
CDAP-18586: getApplicationSpecification() method in interface io.cdap.cdap.api.schedule.ProgramStatusTriggerInfo has been removed in CDAP 6.6.0, which can cause the CDAP build break if you are using this method.
CDAP 6.5.1
Enhancements
PLUGIN-883, PLUGIN-897: Added Encryption Key Name property to the following plugins so users can encrypt any new resources created by these plugins with Customer Managed Encryption Keys (CMEK):
-
Big Query Execute action
-
GCS Copy action
-
GCS Create action
-
GCS Move action
-
GCS Done File Marker Pipeline Alert
-
BigQuery Batch source
-
BigQuery Multi Table sink
-
BigQuery Table Sink
-
Google Cloud Storage sink
-
Google Cloud Storage Multi File sink
-
Google Cloud PubSub sink
-
Google Cloud Spanner sink
-
Transformation Pushdown to BigQuery
PLUGIN-898: Added Location property to GCS Copy and GCS Move action plugins to auto-create destination buckets if they do not exist before running the pipeline. Previously, the bucket had to exist before running the pipeline.
CDAP-18566: The File connection now browses the file system. For example, on a Hadoop cluster, the File connection now browses the HDFS file system. For CDAP Sandbox, the File connection still browses the local file system.
CDAP-18532: Added the following optional cdap-site.xml configs:
If a config router.block.request.enabled is true in conf, the request router should respond with a specific response (provided through config) to every user request, hence blocking all the user requests.
If a status code is provided using config router.block.request.status.code, the server should respond with this status code, the default value should be 503.
If a response message is provided using config router.block.request.message, the server should respond with this response body; otherwise the response body should be empty.
CDAP-18384: Added metrics for authorization in CDAP.
Bug Fixes
CDAP-18571: Fixed an issue where messages couldn’t be retrieved for Kafka topics. This broke in 6.5.0 and is now fixed in 6.5.1.
CDAP-18538, CDAP-184254: Fixed an issue where you couldn’t create a profile for an existing Dataproc cluster.
CDAP-18529: Fixed an issue that caused pipelines to fail when Transformation Pushdown was enabled and used macros as properties.
CDAP-18446: Fixed an issue that caused long running programs, like Replication, to fail within the default Hadoop delegation token timeout. Now, these tokens get renewed so that the job keeps running.
CDAP-18439: Fixed an issue in Replication that caused the Configure button to result in an error when you clicked it.
CDAP-18428: Fixed an issue that caused pipelines to fail with an Access Denied error when the pipeline had BigQuery plugins or Transformation Pushdown configuration that included a Dataset Project ID that was in a different project than the specified Project ID:
-
BigQuery sources
-
BigQuery sinks
-
BigQuery Multi Table sinks
-
Transformation Pushdown
The Access Denied error was due to missing permissions on the service account.
To ensure pipelines with BigQuery or BigQuery Multi Table sinks and pipelines with Transformation Pushdown enabled run successfully, assign the following roles to the Project ID service account:
-
BigQuery Job User role to run jobs
-
GCE Storage Bucket Admin role to create a temporary bucket
If the dataset is not in the same project that the BigQuery job will run in, the Dataset Project ID service account must be granted the following role to write data to a BigQuery dataset or table:
- BigQuery Data Editor role
CDAP-18423: Fixed an issue in the GCS connection that prevented browsing and parsing files stored in folders under buckets.
CDAP-18335: Fixed an issue where the UI was unusable until an error displayed in the UI was closed by clicking the x icon.
CDAP-18318: Fixed an issue where users did not need permission to restart system services, reset system service log levels, get system service statuses, etc. Now, if authorization is enabled on the cluster, users will need to have the corresponding permissions for these system services in order to access them.
CDAP-18249: Fixed an issue where the Upload window didn’t close after uploading a user-defined directive due to missing properties in the user-defined directive json.
PLUGIN-899: Fixed an issue that caused custom formats to be unusable in the GCS source and sink.
CDAP 6.5.0
New Features
Connections
CDAP-17870: Added global connections for sources in Wrangler and data pipelines. For more information, see Managing Connections. Also added new endpoints for connections to the Pipeline Microservices.
CDAP-17924: Redesigned the Namespace Admin page.
Dataproc
CDAP-17999: Added support for labels in the Dataproc provisioner.
CDAP-17862: Added Shielded VMs as configuration settings for the Dataproc provisioner. For more information, see Google Dataproc.
CDAP-18004: Added support for running worker pods using different Kubernetes service accounts.
Namespaces
CDAP-17731: Added support to show current namespace name in the footer.
CDAP-17877, CDAP-17876: Added Connections and Drivers to Namespace Admin page for centralized management of all connections and Drivers. For more information, see JDBC Drivers and Managing Connections.
Spark 3
CDAP-17693: Added Spark 3 support for Standalone CDAP, CDAP Sandbox, and Previewing data.
CDAP-17930: Added Dataproc version to 2.0 as the default for new and upgraded pipelines. For more information, see “Upgrade Notes for Spark 3” below.
Transformation Pushdown
CDAP-17863: Added support for Transformation pushdown into BigQuery for Joiner transformations. For more information, see Using Transformation pushdown.
Improvements
CDAP-17730: Added authorization checks for preferences, logging, compute profiles, and metadata endpoints.
CDAP-17915: Added support to search for tables based on schema name when you select tables for a Replication job.
CDAP-17946: Improved error messages on the Pipeline List page.
CDAP-17973: Improved Wrangler error messages
CDAP-18024: Added support for running CDAP as a non-root user.
CDAP-18039: Added additional trace logging in the authorization flow for debugging.
CDAP-18146: Pods created by CDAP now inherit their ImagePullPolicy from the pod which created them.
CDAP-18194: Added support for BIGNUMERIC data type for BigQuery target in replication.
PLUGIN-764: Added support for Datetime data type for SQL Server batch source plugins.
PLUGIN-645: Added support for Datetime data type for Replication jobs.
Behavior Changes
CDAP-18114: MySQL, Oracle, PostgreSQL, and SQL Server batch sources, sinks, actions, and pipeline alerts are now installed by default as system plugins. Previously, these plugins were available in the Hub as user plugins.
CDAP-17898: When you use a connection in Wrangler and create a data pipeline, CDAP now creates a pipeline with the source plugin and then Wrangler transformation. In previous releases, CDAP created the pipeline with just the Wrangler transformation. You had to manually add the source plugin to the pipeline and configure it.
Bug Fixes
CDAP-17895: Fixed an issue in Replication that caused jobs to fail if more than 1000 tables are selected for replication.
CDAP-17919: Fixed an issue that caused replication jobs to hang when there were too many Delete or DDL events.
CDAP-17939: Improved the Messaging Service cleanup strategy so that it uses far fewer resources and cannot go out of memory.
CDAP-17942: Fixed an issue that caused plugin validation to fail when a macro is used within a macro function. For example: ${logicalStartTime(${date_format})}
CDAP-17943: Fixed an issue that caused pipelines with aggregations and Decimal fields to fail with an exception.
CDAP-17959: Fixed an issue that caused Wrangler to ignore all the other columns other than the given column when parsing Excel files.
CDAP-17965: For CDAP instances running on Kubernetes, fixed an issue that prevented new previews from being scheduled after the preview manager had been stopped 10 times.
CDAP-17995: Fixed Wrangler to fail pipelines upon error. In Wrangler 6.2 and above, there was a backwards incompatible change where pipelines did not fail if there was an error and instead were marked as completed.
CDAP-18002: Fixed an issue in Replication that caused jobs to fail when restarted during snapshotting.
CDAP-18012, CDAP-18003, CDAP-17853: Improved resilience of TMS.
CDAP-18060: Fixed an issue in CDAP Sandbox that caused Get Schema to fail when the source includes the Format field.
CDAP-18131: Fixed an issue where replication to BigQuery was failing because the source table had column names which are reserved keywords in the BigQuery.
PLUGIN-178: Fixed an issue while writing non-null values to a nullable field in BigQuery.
PLUGIN-635: Fixed an issue in the BigQuery plugins to correctly delete temporary GCS storage buckets.
PLUGIN-654: Fixed an issue that caused pipelines to fail when Pub/Sub source Subscription field was a macro.
PLUGIN-655: Fixed an issue in the BigQuery sink that caused failures when the input schema was not provided.
PLUGIN-669: Fixed Join Condition Type to be displayed in the Joiner for pipelines upgraded from versions before 6.4.0
PLUGIN-678: Fixed an issue in the BigQuery sink that caused pipelines to fail or give incorrect results.
PLUGIN-697: Fixed an issue that caused File Source Plugin validation to fail when there was a macro in the Format field.
Upgrade Notes for Spark 3
In CDAP 6.5.0, Spark 3 is the new default engine that will be used for Preview and running pipelines on Dataproc. Also Spark 1 support was removed from CDAP.
After an instance is upgraded to version 6.5.0, any new or upgraded pipeline that uses a Dataproc profile without an explicit image version will use the latest Dataproc image 2.0 that has Spark 3.1 bundled.
Any pipeline that was not upgraded will still use the original 1.3 Dataproc image that has Spark 2.3 bundled.
What does it mean for pipeline developers / operations?
Spark 3.1 provides a lot of improvements in different areas. See the release notes for Spark 3.0 and Spark 3.1. The main changes that affect backwards compatibility are:
Python 2 support is removed, any PySpark code must be Python 3 compatible.
Spark 3.1 uses Scala 2.12 that is binary incompatible with Scala 2.11. Most of the code is source compatible, so recompile Scala code with Scala 2.12 if you have any issues.
What does it mean for plugin developers?
If you use any Scala code, make sure it’s binary compatible with the corresponding Scala version: 2.12 for Spark 3 and 2.11 for Spark 2 execution environment.
This can be easily achieved by referencing the proper spark2_2.11 or spark3_2.12 version of the CDAP artifact, e.g. see [CDAP-17693] Introduce spark 3 for tests, drop spark 1, create spark … by tivv · Pull Request #1364 · cdapio/hydrator-plugins . Note that you must explicitly choose the version because artifacts without the version that were previously using Spark 1 are no longer available.
If you have any dependencies on Scala-specific artifacts (e.g. Kafka), change those as well.
The new Hadoop version used in dependencies is 2.6.0 instead of 2.3.0.
What to do in case of any problems?
Spark 2 is still fully supported by CDAP. If you use Dataproc, enter image version “1.3” in your provisioning profile and it will use exactly the same image CDAP 6.4 uses.
It’s highly recommended that you solve any problems found and migrate to the Spark 3 execution environment as it brings a number of enhancements including huge performance improvements.
Known Issues
Database connections
Although you can create connections for Database, MySQL, Oracle, PostgreSQL, and SQL Server sources, the plugin properties do not include Use Connection. This means that you cannot reference a connection in a database source plugin. However, from the Properties page in a database source plugin, you can select a connection to have CDAP populate the plugin properties with the connection properties.
To use the properties set in these connections in the corresponding batch source plugin, follow these steps:
In Pipeline Studio, add the source plugin to the canvas.
Click Properties.
Click Browse Database.
The Browse Database page appears with the available connections listed in the left panel.
Click the connection you want to use.
Locate the table you want to add to the source plugin and click it.
The source properties now include all of the properties from the connection.
CDAP 6.4.1
New Features
Replication
PLUGIN-645: BigQuery targets now support the Datetime data type.
Bug Fixes
CDAP-17943: Fixed an issue that caused pipelines with aggregations and Decimal fields to fail with an exception.
CDAP-17939: Improved the Messaging Service cleanup strategy so that it uses far fewer resources and cannot go out of memory.
CDAP-18012, CDAP-18003, CDAP-17853: Improved resilience of TMS.
PLUGIN-669: Fixed Join Condition Type to be displayed in the Joiner for pipelines upgraded from versions prior to 6.4.0
CDAP-17995: Fixed Wrangler to fail pipelines upon error. In Wrangler 6.2 and above, there was a backwards incompatible change where pipelines did not fail if there was an error and instead were marked as completed.
CDAP-17965: For CDAP instances running on Kubernetes, fixed an issue that prevented Previews from starting once a user has stopped a Preview run 10 times.
PLUGIN-178: Fixed an issue while writing non-null values to a nullable field in BigQuery.
PLUGIN-635: Fixed an issue in the BigQuery plugins to correctly delete temporary GCS buckets.
PLUGIN-655: Fixed an issue in the BigQuery sink that caused failures when the input schema was not provided.
PLUGIN-678: Fixed an issue in the BigQuery sink that caused pipelines to fail or give incorrect results.
PLUGIN-654: Fixed an issue that caused pipelines to fail when Pub/Sub source Subscription field was a macro.
CDAP 6.4.0
New Features
Datetime Data Type
PLUGIN-615, PLUGIN-614: Added Datetime data type support to the following plugins:
- BigQuery batch source
- BigQuery sink
- BigQuery Multi Table sink
- Bigtable batch source
- Bigtable sink
- Datastore batch source
- Datastore sink
- GCS File batch source
- GCS File sink
- GCS Multi File sink
- Spanner batch source
- Spanner sink
- File source
- File sink
- Wrangler
- Amazon S3 batch source
- Amazon S3 sink
- Database source
Also, BigQuery datetime type can be directly mapped to CDAP datetime data type.
CDAP-17684, CDAP-17636: Added support for DateTime data type in Wrangler. You can now select Parse > Datetime to transform columns of strings to datetime values and Format > Datetime to change the date and time pattern of a column of datetime values.
Added new Wrangler directives that you can use in Power Mode to transform columns of strings to datetime values: Parse as Datetime, Current Datetime, Datetime to Timestamp, Format Datetime, Timestamp to Datetime
CDAP-17620: Added support Datetime logical data type in CDAP schema
Dataproc
CDAP-17622: Added machine type, cluster properties, and idle TTL as configuration settings for the dataproc provisioner. For more information, see Google Dataproc.
Security
CDAP-17709: Added support for PROXY authentication mode to nodejs proxy. CDAP UI now supports both MANAGED and PROXY modes of authentication. For more information, see Configuring Proxy Authentication Mode.
Pipeline Studio
CDAP-17549: Added support for data pipeline comments. For more information, see Adding comments to a data pipeline.
Plugin OAUTH Support
CDAP-17611: Updated Salesforce plugins to incorporate with the new OAuth macro function
CDAP-17610: Implemented a new macro function for OAuth token exchange
CDAP-17609: Implemented new HTTP endpoints for OAuth management
Replication
CDAP-17674: Added support to allow users to specify a runtime argument, retain.staging.table, to retain BigQuery staging table to help debug issues
CDAP-17595: Added upgrade support for replication jobs
CDAP-17471: Added the ability to duplicate, export, and import replication jobs
CDAP-17337: Added property to configure dataset name in the BigQuery replication target. By default, the dataset name is the same as the Replication source database name. For more information, see Google BigQuery Target.
CDAP-16755: Added ability to add the runtime argument "event.queue.capacity" to specify the capacity of the event queue in bytes for Replication jobs. If the target plugin consumes the event slower than the source plugin emits the event, the event may stay in the queue and occupy the memory. With this capability the user can control how much memory, at most, can be used for the event queue.
Kubernetes
CDAP-17618: Replaced Zookeeper for K8S CDAP setup with K8S secrets. For more information, see Prepare the secret token for authentication service.
CDAP-17466: Added Authentication functionality for CDAP on Kubernetes setup. For more information, see Installation on Kubernetes.
Joiner Analytics Plugin
CDAP-17607: Added advanced join conditions to the joiner plugin. This allows users to specify an arbitrary SQL condition to join on. These types of joins are typically much more costly to perform than basic join on equality. For more information, see Join Condition Type.
New System Plugins for Data Pipelines
PLUGIN-558: Added new post-action plugin, GCS Done File Marker. This post-action plugin marks the end of a pipeline run by creating and storing an empty DONE (or SUCCESS) file in the given GCS bucket upon a pipeline completion, success, or failure so that you can use it to orchestrate downstream/dependent processes.
Improvements
PLUGIN-601: Added a metric for bytes read from database source
PLUGIN-571: Added support to filter tables in the Multiple Database Tables Batch Source
PLUGIN-570: Improved error handling for Multiple Database Batch Sources and BigQuery multi-table sink that enables the pipelines to continue if one or more tables fail
CDAP-17724: Renamed replication pipelines to jobs
CDAP-17721: Added support for Kerberos login in K8s environment
CDAP-17675: Renamed Delete button to Remove in Replication Assessment report
CDAP-17670: Improved plugin initialization performance optimization
CDAP-17650: Added tag with parent artifact detail to Dataproc cluster created by CDAP
CDAP-17645: Set a timeout on the ssh connection so that the pipeline runs fails when the cluster becomes unreachable
CDAP-17642: Added namespace count to Dataplane metrics
CDAP-17621: Added the Customer Manager Encryption Key (CMEK) configuration property for replication BigQuery target. For more information, see Google BigQuery Replication Target.
CDAP-17613: Improved Replication Assessment page to highlight SQL Server tables with Schema issues in red
CDAP-17603: Added ability to jump to any step when modifying the Replication draft
CDAP-17601: Improved performance by loading data directly into the target table during replication snapshot process
CDAP-17597: Added poll metrics in Overview and Monitoring in Replication detail view
CDAP-17583: Improved Performance for Replication
CDAP-17582: Added ability to pass additional properties for Debezium and jdbc drivers for replication sources
CDAP-17482: Added ability to start Replication app from a last known checkpoint.
CDAP-17474: Added support for configuring elasticsearch TLS connection to trust all certs. For more information, see Elasticsearch.
CDAP-17414: Improved Replication Table selection user experience
CDAP-17289: Improved reliability of Pub/Sub Source plugin
CDAP-17248: Added File Encoding property to Amazon S3, File and GCS File Reader batch source plugins
CDAP-17114: Removed the record view in pipeline preview for the Joiner node because it was misleading
CDAP-16548: Renamed the Staging Bucket Location property to Location in the BigQuery Target properties page. For more information, see Google BigQuery Target.
CDAP-16623: Removed multiple way to collapse/expand the Connection menu
CDAP-16008: Added support for Kerberos Hadoop cluster in the Remote Hadoop Provisioner
CDAP-15552: Fixed Wrangler to highlight new column generated by a directive
Behavior Changes
CDAP-16180: Resolved macro to preferences during pipeline validation
In previous releases, when you validated a plugin, macros were not being resolved with preferences.
In CDAP 6.4.0, when you validate a plugin, macros now get resolved with preferences.
PLUGIN-470: Removed Multi sink runtime argument requirements, allowing users to add simple transformations in multi-source/multi-sink pipelines.
In previous releases, multi-sink plugins require the pipeline to set a runtime argument for each table, with the schema for each table.
In CDAP 6.4.0, CDAP determines the schema dynamically at runtime instead of requiring arguments to be set.
Bug Fixes
PLUGIN-610: Fixed Bigtable Batch Source plugin
PLUGIN-606: FTP batch source now works with empty File System Properties. See “Deprecations” below.
PLUGIN-545: Added support for strings in Min/Max aggregate functions (used in both Group By and Pivot plugins)
PLUGIN-539: Fixed Salesforce plugin to correctly parse the schema as Avro schema to make sure all the field names are accepted by Avro
PLUGIN-517: Fixed data pipeline with BigQuery sink that failed with INVALID_ARGUMENT exception if the range specified was a macro
PLUGIN-222: Fixed Kinesis Spark Streaming source
CDAP-17746: Fixed an issue in field validation logic in pipelines with BigQuery sink that caused a NullPointerException
CDAP-17744: Fixed Schema editor to show UI validations
CDAP-17737: Fixed Conditions plugins to work with Spark 3
CDAP-17732: Fixed the Wrangler Generate UUID directive to correctly generate a universally unique identifier (UUID) of the record
CDAP-17718: Fixed advanced joins to recognize auto broadcast setting
CDAP-17717: Fixed upgraded CDAP instances to include arrow to the Error Collector
CDAP-17713: Fixed Pipeline Studio UI to send null instead of string for blank plugin properties
CDAP-17703: Fixed Pipeline Studio to use current namespace when it fetches data pipeline drafts
CDAP-17691: Fixed SecureStore API to support SYSTEM namespace
CDAP-17683: Fixed million indicator on Replication Monitoring page
CDAP-17680: Fixed Replication statistics to display on the dashboard for SQL Server
CDAP-17678: Fixed an issue where clicking the Delete button on Replication Assessment page resulted in an error for the replication job
CDAP-17653: Removed the usage of authorization token while generating session token in nodejs proxy.
CDAP-17641: Schema name is now shown when selecting tables to replicate
CDAP-17635: Fixed Replication to correctly insert rows that were previous deleted by a replication job
CDAP-17630: Data pipelines running in Spark 3 enabled Dataproc cluster no longer fail with class not found exception
CDAP-17617: Fixed Replication Overview page to display the label of the table status when you hover over the table status
CDAP-17598: Added ability to hover over metrics in the Pipeline Summary page
CDAP-17591: Fixed Wrangler completion percentage
CDAP-17584: Fixed Replication with a SQL Server source to generate rows correctly in BigQuery target table if snapshot failed and restarted
CDAP-17570: Fixed an issue where SQL Server replication job stopped processing data when the connection was reset by the SQL Server
CDAP-17568: Fixed the Replication wizard to close without error when you click the X icon to exit
CDAP-17495: Fixed an error in Replication wizard Step 3 "Select tables, columns and events to replicate" where selecting no columns for a table caused the wizard to...
CDAP 6.3.0
Summary
This release introduces a number of new features, improvements, and bug fixes to CDAP. The main highlight of the release is:
Replication
Added metrics for amount of data processed and error count from the replicator app.
Improved the replication UI page for better user experience.
New Features
CDAP-16835 - Added support for upgrading applications via REST API. Example usage is to upgrade all pipelines in a namespace to use the latest available artifacts.
CDAP-16836 - Added new options in CDAP CLI to take URI instead of host and port combination.
CDAP-16980 - New Log Viewer feature which enables users to see the most recent logs.
CDAP-17355 - Added Draft count metric and created Drafts API to manage drafts in the backend.
CDAP-17418 - This feature supports replicating those databases that have a "schema" concept. While "schema" is just a collection of DB objects.
CDAP-17460 - Redesign Replication Detail page.
CDAP-17461 - Redesign Dashboard page into Operations page.
Improvements
CDAP-16812 - Updated labels and descriptions for Service Account properties in the Dataproc provisioner.
CDAP-16815 - Added a metric records.updated in BigQuery sink. This will give a total of all the inserts, updates, and upserts into the sink.
CDAP-16918 - Introduced a new REST API for getting all application details across all namespaces.
CDAP-16929 - Added the ability to select a Custom Dataproc Image. The complete URI for the custom image should be specified.
CDAP-17015 - Updated Preview to show the number of Preview runs pending before current run (if there are any runs pending). The number of pending runs is shown under the timer in the UI.
CDAP-17065 - Disabled Spark YARN app retries since Spark already performs retries at a task level.
CDAP-17077 - Changed the auto-caching strategy in Spark pipelines to default to using disk only caching instead of memory due to common out of memory failures. Also changed the caching strategy to only cache at places that would prevent sources from being recomputed instead of the more aggressive caching previously done.
CDAP-17078 - Added a setting to consolidate multiple pipeline branches into single operations in Spark pipelines. This can improve performance in pipelines by avoiding recomputation. This can be turned on by setting a preference or runtime argument for 'spark.cdap.pipeline.consolidate.stages' to 'true'.
CDAP-17095 - Added Distribution to AutoJoiner API to increase performance for skewed joins.
CDAP-17123 - Made "records.updated" metric available for GCS Batch Sink plugin.
CDAP-17130 - Added Joiner Distribution support to MapReduce and streaming pipelines.
CDAP-17179 - Added new properties Filesystem properties
and Output File Prefix
for GCS Sink.
CDAP-17182 - Enable traffic compression in Runtime service.
CDAP-17198 - Added Runtime service to the system service statues.
CDAP-17202 - Improved commit performance for sinks.
CDAP-17249 - Added documentation about Regex Path Filter property to File and GCS sources.
CDAP-17389 - Added options for master and worker disk type and fixed the Dataproc provisioner to use the configured disk settings for secondary workers on autoscale clusters.
CDAP-17425 - Exposed the number of preview records requested to source plugins.
CDAP-17428 - Changed pipeline stage consolidation to be enabled by default. This improves the performance of certain types of pipelines.
CDAP-17439 - Added support for Hadoop 3 and Spark 3 for program execution.
CDAP-17462 - Delta source developers don't need to populate previous rows in the update event if the delta source supports row_id which is a unique identifier that can identify a row.
CDAP-17484 - Replication Assessment page now displays an error when a user selects two source tables with the same name to replicate, which is not supported
PLUGIN-282 - Added new Data Cacher plugin to allow users to manually cache data at certain points in a pipeline.
PLUGIN-303 - Added Distribution settings to Joiner plugin for increased performance in skewed joins.
Bug Fixes
CDAP-16797- CDAP UI now validates Pipeline Alerts before adding to the Pipeline Studio.
CDAP-16816 - Fixed schedule properties to overwrite preferences set on the application instead of the other way around. This most visibly fixed a bug where the compute profile set on a pipeline schedule or trigger would get overwritten by the profile for the pipeline.
CDAP-16824 - Fixed UI to show plugin properties for plugins that don't have a plugin widget.
CDAP-16845 - Fixed a bug that started running Preview for pipelines with post-run actions even if the user chooses the option to not run Preview.
CDAP-16870 - Fixed PySpark support to work with Spark 2.1.3+.
CDAP-16879 - For BigQuery sinks, if both Truncate Table and Update Table Schema are set to True, when you run the pipeline, only Truncate Table will be applied. Update Table Schema will be ignored.
CDAP-16880 - Removed schema validation from BQ sink when 'Truncate Table' option is set to True.
CDAP-16891 - Unsupported pipelines in drafts would be upgraded when users open them.
CDAP-16898 - Fixed a bug that did not fetch Preview data when the plugin label had spaces in it.
CDAP-16950 - Includes all ERROR level logs logged under the application logging context.
CDAP-16959 - Fixed an issue in Preview with runtime arguments re-rendering and losing focus when containing macros.
CDAP-16972 - Fixed an issue where Preview config would open when trying to stop a Preview.
CDAP-16975 - If there are multiple versions of a plugin, the latest version is now the default and is the version that gets added to pipelines. If the user has already chosen a specific version (older version), it defaults to that instead of the latest.
CDAP-16976 - UI resets the default version of plugins for specific users during upgrade. When users upgrade from 6.1.2 to 6.1.3 or later, UI will reset the default version of the plugin the user has already chosen. Post upgrade, if the user uses the same plugin, UI will choose the latest version of the same plugin.
CDAP-16993 - Fixed a bug in Preview for fields that have non-string types such as bytes.
CDAP-17000 - Changed default value of spark.network.timeout to 10 minutes to make pipeline execution more stable for shuffle heavy pipelines.
CDAP-17029 - Fixed an issue that caused an extra empty row to appear when sampling GCS text files in Wrangler.
CDAP-17043 - Fixed a bug for showing dropdown menu for Wrangler tabs to be correct. Existing dropdown overlapped with other UI elements hindering the usage of UI.
CDAP-17044 - Columns names are validated for BigQuery sink.
CDAP-17045 - Fixed the bug to allow large pipelines with -
in the name to properly overflow in the UI.
CDAP-17057 - Fixed a bug that did not allow a user to make further changes to preferences when saving preferences returned an error.
CDAP-17059 - Added a check to fail pipeline deployment if there is an action in the middle of the pipeline.
CDAP-17074 - Improved state transitions for starting pipelines in app fabric to increase stability if app fabric unexpectedly restarts.
CDAP-17097 - Fixed a bug that caused Splitter transforms to be unable to fetch their output ports and schemas.
CDAP-17117 - Fixed styling bug so header of Preview tab does not scroll with table.
CDAP-17121 - Fixed a bug where Preview run fails on null values due to Json Encorder NullPointerException.
CDAP-17133 - Fixed tab styles for users on Mac with system preferences set to show scrollbars always in Chrome.
CDAP-17135 - Fixed a race condition in stopping Spark program in Standalone CDAP that can cause stop to hang.
CDAP-17137 - Fixed a bug that showed preview pipeline stopping in UI even when call to stop pipeline returns error.
CDAP-17138 - Fixed a bug that caused an empty error banner to appear when the user stops Preview.
CDAP-17139 - Fixed styling of Preview tab so that side by side tables and record tables are aligned.
CDAP-17140 - Fixed a bug so error banner for deploy failure shows failure details from backend status message, if they exist.
CDAP-17141 - Fixed bug that allowed a user to make unsaved config changes by disabling Pipeline Config button in Preview mode when run is in progress.
CDAP-17145 - Modified preview timer logic to use submitTime instead of pipeline run startTime, to take into account time spent in INIT and WAITING states.
CDAP-17161 - Reduce memory footprint for program execution monitoring.
CDAP-17166 - Fixed a bug that caused the setting for the number of executors in streaming pipelines to be ignored.
CDAP-17171 - Fixed horizontal tab styling to handle mac system setting "scrolling always on" in chrome.
CDAP-17172 - Fixed a bug that showed banner about stopping pipeline when a pipeline was deployed after running Preview.
CDAP-17174 - Fixed a bug that doesn't allow the user to stop Preview if pipeline run has already completed.
CDAP-17213 - Pick up Spark configuration correctly from the remote Hadoop cluster for program execution.
CDAP-17217 - Fixed overflow styling for long text in preview tables.
CDAP-17224 - Fixed an issue where the Dashboard page will show the graph being full when there is no run during the time period selected.
CDAP-17225 - Fixed a bug that caused pipeline deployment to fail if the pipeline contained comments.
CDAP-17233 - Improved Wrangler error messages for incorrect syntax and errors in Wrangler command line.
CDAP-17237 - Fixed a bug where the cluster's default Hadoop settings were not being used in pipelines.
CDAP-17239 - Fixed a bug in StandaloneMain which prematurely deletes the Authorizer classpath directories.
CDAP-17243 - Hide Analytics and Rules Engine by default from UI.
CDAP-17246 - Fixed pipeline exported in ...
CDAP 6.2.3
Summary
This release contains critical bugfixes to the Dataproc provisioner in CDAP.
Bug Fixes
- Fixed a bug in the Existing Dataproc provisioner that it checks for network unnecessarily (CDAP-17323)
- Allow Dataproc provisioner to accept the default value of property gcp-dataproc.serviceAccount from cdap-site.xml. This property is to configure what service account a Dataproc cluster should use when running the pipeline. (CDAP-17326)