Skip to content

Cask Data Application Platform - 4.3.0

Compare
Choose a tag to compare
@prinam prinam released this 29 Aug 19:44
· 560 commits to release/4.3 since this release

Summary

1. Data Pipelines:
- Support for conditional execution of parts of a pipeline
- Ability for pipelines to trigger other pipelines for cross-team, cross-pipeline inter-connectivity, and to build complex interconnected pipelines.
- Improved pipeline studio with redesigned nodes, undo/redo capability, metrics
- Automated upgrade of pipelines to newer CDAP versions
- Custom icons and labels for pipeline plugins
- Operational insights into pipelines

2. Data Preparation:
- Support for User Defined Directives (UDD), so users can write their own custom directives for cleansing/preparing data.
- Restricting Directive Usage and ability to alias Directives for your IT Administrators to control directive access

3. Governance & Security:
- Standardized authorization model
- Apache Ranger Integration for authorization of CDAP entities

4. Enhanced support for Apache Spark:
- PySpark Support so data scientists can develop their Spark logic in Python, while still taking advantage of enterprise integration capabilities of CDAP
- Spark Dataframe Support so Spark developers can access CDAP datasets as Spark DataFrames

5. New Frameworks and Tools:
- Microservices for real-time IoT use cases.
- Distributed Rules Engine - for Business Analysts to effectively manage rules for data transformation and data policy

New Features

Data Pipelines Enhancements

  • Added a new splitter transform plugin type that can send output to different ports. Also added a union splitter transform that will send records to different ports depending on which type in the union it is and a splitter transform that splits records based on whether the specified field is null. (CDAP-12033)

  • Added a way for pipeline plugins to emit alerts, and a new AlertPublisher plugin type that publishes those alerts. Added a plugin that publishes alerts to CDAP TMS and an Apache Kafka Alert Publisher plugin to publish alerts to a Kafka topic. (CDAP-12034)

  • Batch data pipelines now support condition plugin types which can control the flow of execution of the pipeline. Condition plugins in the pipeline have access to the stage statistics such as number of input records, number of output records, number of error records generated from the stages which executed prior to the condition node. Also implemented Apache Commons JEXL based condition plugin which is available by default for the batch data pipelines. (CDAP-12108)

  • Plugin prepareRun and onFinish methods now run in a separate transaction per plugin so that pipelines with many plugins will not timeout. (CDAP-12167)

  • All pipeline plugins now have access to the pipeline namespace and name through their context object. (CDAP-12191)

  • Added a feature that allows undoing and redoing of actions in pipeline Studio. (CDAP-9107)

  • Made pipeline nodes bigger to show the version and metrics on the node. (CDAP-12057)

  • Revamped pipeline connections, to allow dropping a connection anywhere on the node, and allow selecting and deleting multiple connections using the Delete key. (CDAP-12077)

  • Added an automated UI flow for users to upgrade pipelines to newer CDAP versions. (CDAP-10619)

  • Added visualization for pipeline in UI. This helps visualizing runs, logs/warnings and data flowing through each node for each run in the pipeline. (CDAP-11889)

  • Added support for plugins of plugins. This allows the parent plugin to expose some APIs that its own plugins will implement and extend. (CDAP-12111)

  • Added ability to support custom label and custom icons for pipeline plugins. (CDAP-12114)

  • BatchSource, BatchSink, BatchAggregator, BatchJoiner, and Transform plugins now have a way to get SettableArguments when preparing a run, which allows them to set arguments for the rest of the pipeline. (CDAP-10974)

  • Runtime arguments are now available to the script plugins such as Javascript and Python via the Context object. (CDAP-10653)

  • Added a method to PluginContext that will return macro evaluated plugin properties. (CDAP-12472)

  • Enhanced add field transform plugin to add multiple fields. (CDAP-12094)

Triggers

  • Added capabilities to trigger programs and data pipelines based on status of other programs and data pipelines. (CDAP-11912)

  • Added the capability to use plugin properties and runtime arguments from the triggering data pipeline as runtime arguments in the triggered data pipeline. (CDAP-12382)

  • Added composite AND and OR trigger. (CDAP-12232)

Data Preparation Enhancements

  • Added the ability for users to connect Data Preparation to their existing data in Apache Kafka. (CDAP-11618)

  • Added point and click interaction for performing various calculations on data in Data Prep. (CDAP-12092)

  • Added point and click interaction for applying custom transformations in Data Prep. (CDAP-12118)

  • Added point and click interaction to mask column data. (CDAP-9530)

  • Added point and click interaction to encode/decode column data. (CDAP-9532)

  • Added point and click interaction to parse Avro and Excel files. (CDAP-11869)

  • Added point and click interaction for replacing column names in bulk. (CDAP-11977)

  • :cask-issue:CDAP-12091 - Added point and click interaction for defining and incrementing variable. (CDAP-12091)

Spark Enhancements

  • Added capabilities to run PySpark programs in CDAP. (CDAP-4871)

Governance and Security Enhancements

  • Implemented the new authorization model for CDAP. The old authorization model is no longer supported. (CDAP-12134)

  • Added a new configuration security.authorization.extension.jar.path in cdap-site.xml which can be used to add extra classpath and is avalible to cdap security extensions. (CDAP-12317)

  • Removed automatic grant/revoke privileges on CDAP entity creation/deletion. (CDAP-12100)

  • Added support for authorization on Kerberos principal for impersonation. (CDAP-12367)

  • Modified the authorization model so that read/write on an entity will not depend on its parent. (CDAP-11839)

  • Deprecated createFilter() and added a new isVisible API in AuthorzationEnforcer. Deprecated grant/revoke APIs for EntityId and added new one for Authorizable which support wildcard privileges. (CDAP-12135)

  • Removed version for artifacts for authorization policy to be consistent with applications. From 4.3 onwards CDAP does not support policies on artifact/application version. (CDAP-12283)

Other New Features

  • Added a wizard to allow configuring and deploying microservices in UI. (CDAP-11940)

  • Enabled GC logging for CDAP services. (CDAP-6329)

  • Added support for HDInsight 3.6. (CDAP-11448)

  • CSD now performs a version compatibility check with the active CDAP Parcel. (CDAP-4874)

  • Added live migration of metrics tables from pre 4.3 tables to 4.3 salted metrics tables. (CDAP-12348)

  • Added capability to salt the row key of the metrics tables so that writes are evenly distributed and there is no region hot spotting. (CDAP-12017)

  • Added a REST API to check the status of metrics processor. We can view the topic level processing stats using this endpoint. (CDAP-12068)

  • Added option to disable/enable metrics for a program through runtime arguments or preferences. This feature can also be used system wide by enabling/disabling metrics in cdap-site.xml. (CDAP-12070)

  • Added global "CDAP" config to enable/disable metrics emission from user programs.By default metrics is enabled. (CDAP-12290)

  • DatasetOutputCommiter's methods are now executed in the MapReduce ApplicationMaster, within OutputCommitter's commitJob/abortJob methods. The MapReduceContext.addOutput(Output.of(String, OutputFormatProvider)) API can no longer be used to add OutputFormatProviders that also implement the DatasetOutputCommitter interface. (CDAP-1952)

  • Allow appending to (or overwriting) a PartitionedFileSet's partitions when using DynamicPartitioner APIs. Introduced a PartitionedFileSet.setMetadata API which now allows modifying partitions' metadata. (CDAP-12084)

  • Exposed a programmatic API to leverage Hive's functionality to concatenate a partition of a PartitionedFileSet. (CDAP-12085)

  • Workflow now allows adding configurable conditions with the lifecycle methods. (CDAP-12378)

  • Allow programs to have concurrent runs in integration test cases. (CDAP-8629)

Bug Fixes

  • Removed deprecated cdap-etl-realtime artifact. (CDAP-12103)

  • Removed deprecated deprecated cdap-etl-batch jar from packaging. (CDAP-12123)

  • Allowed user to override the InputFormat class and OutputFormat class of a FileSet at runtime. (CDAP-9150)

  • Fixed an issue with the order of HBase compatibility libraries in the class path. (CDAP-12285)

  • Fixed an issue where CDAP Sentry Integration did not rely on every user having their own individual group. (CDAP-9125)

  • Added support for a description field in a pipeline config that will be used as the application's description if set. (CDAP-11095)

  • Reuse network connections for TMS client. (CDAP-12020)

  • Removes the existing hierarchal authorization model from CDAP. (CDAP-12143)

  • Added an optional delimiter property to the HDFS sink to allow users to configure the delimiter used to separate record fields. (CDAP-12226)

  • Individual system service status API no longer has to go through CDAP master. (CDAP-12298)

  • Removed dataset usage in the Hive source and sink, which allows it to work in Spark and fixes a race condition that could cause pipelines to fail with a transaction conflict exception. (CDAP-9953)

  • Sinks in streaming pipelines no longer have their prepareRun and onFinish methods called if the RDD for that batch is empty. (CDAP-10228)

  • Fixed CDAP to work with and publish to YARN Timeline Server in a secure environment. (CDAP-11704)

  • HBaseDDLExecutor implementation is now localized to the containers without adding it in the container classpath. (CDAP-11783)

  • Fixed a bug that the stream client gave wrong error message when the authorization check failed for stream read. (CDAP-11800)

  • Fixed a bug that caused pipelines and other programs to not create datasets at runtime with correct impersonated user. (CDAP-11880)

  • Removed non-configurable properties from CSD/Ambari. (CDAP-11944)

  • Fixed a bug where committed data could be removed during HBase table flush or compaction. (CDAP-11948)

  • Fixed a bug where sometimes wrong user was used in explore, which resulted in the failure of deleting namespace. (CDAP-11955)

  • Fixed PartitionedFileSet to work with CombineFileInputFormat, as input to a batch job. (CDAP-12054)

  • Fixed a bug in the pipeline planner that caused some pipelines to fail to deploy with a NoSuchElementException. (CDAP-12122)

  • Fixed a bug in MapReduce pipeline timing metrics, where time for a stage could include time spent in other stages. (CDAP-12125)

  • Fixed an issue that was causing send-to-directive to fail on derived columns in Data Prep. (CDAP-12130)

  • Fixed a bug in StructuredRecord where a union of null and at least two other types could not be set to a null value. (CDAP-12161)

  • Fixed a bug where committed files of a PartitionedFileSet could be removed during transaction rollback in the case PartitionOutput#addPartition was called for a partition that already existed. With this fix, PartitionedFileSet#getPartitionOutput should now only be called within a transaction. (CDAP-12170)

  • Fixed a bug in some MapReduce pipelines that could cause duplicate reads if sources are not properly merged into the same MapReduce. (CDAP-12193)

  • Fixed a bug that made local datasets inaccessible in a Workflow's initialize and destroy methods. (CDAP-12199)

  • Fixed a bug where the file batch source was always using a default schema instead of the actual output schema. (CDAP-12253)

  • Fixed a bug that prevented pipelines from being published when plugin artifact versions were not specified. (CDAP-12269)

  • Fixed a packaging bug that caused debian packages to include the wrong cdap-data-pipeline and cdap-data-streams artifacts for spark2. (CDAP-12284)

  • Fixes an issue where truncating a file set did not preserve its base directory's ownership and permissions. (CDAP-12351)

  • Fixed an issue where certain excessive logging could cause a deadlock in CDAP master. (CDAP-12360)

  • In order to execute Hive queries using MR execution engine in CM 5.12 cluster, the 'yarn.app.mapreduce.am.staging-dir' property needs to be set to '/user' in the YARN Configuration Safety Value in Cloudera Manager. (CDAP-12371)