Releases: cdapio/cdap
CDAP 5.1.1
Improvements
-
Google Cloud Spanner sink will create database and table if they do not exist. (CDAP-14490)
-
Added a Dataset Project config property to the Google BigQuery source to allow reading from a dataset in another project. (CDAP-14542)
Bug Fixes
-
Fixed an issue that caused avro, parquet, and orc classes across file, Google Cloud Storage, and S3 plugins to clash and cause pipeline failures. (CDAP-12229)
-
Fixed a bug where plugins that register other plugins would not use the correct id when using the PluginSelector API. (CDAP-14511)
-
Fixed a bug where upgraded CDAP instances were not able to load artifacts. (CDAP-14515)
-
Fixed an issue where configuration of sink was overwritten by source. (CDAP-14524)
-
Fixed a packaging bug in kafka-plugins that prevented the plugins from being visible. (CDAP-14538)
-
Fixed a bug where plugins created by other plugins would not have their macros evaluated. (CDAP-14549)
-
Removed LZO as a compression option for snapshot and time partitioned fileset sinks since the codec cannot be packaged with the plugin. (CDAP-14560)
CDAP 5.1.0
Summary
This release introduces a number of new features, improvements and bug fixes to CDAP. Some of the main highlights of the release are:
-
Date and Time Support
- Support for Date, Time and Timestamp data types in the CDAP schema. In addition, this support is also now available in pipeline plugins and Data Preparation directives.
-
Plugin Requirements
- A way for plugins to specify certain runtime requirements, and the ability to filter available plugins based on those requirements.
-
Bootstrapping
- A method to automatically bootstrap CDAP with a given state, such as a set of deployed apps, artifacts, namespaces, and preferences.
-
UI Customization
- A way to customize the display of the CDAP UI by enabling or disabling certain features.
New Features
-
Added support for Date/Time in Preparation. Also, added a new directive parse-timestamp to convert unix timestamp in long or string to Timestamp object. (CDAP-14244)
-
Added Date, Time, and Timestamp support in plugins (Wrangler, Google Cloud BigQuery, Google Cloud Spanner, Database). (CDAP-14245)
-
Added Date, Time, and Timestamp support in CDAP Schema. (CDAP-14021)
-
Added Date, Time, and Timestamp support in UI. (CDAP-14028)
-
Added Google Cloud Spanner source and sink plugins in Pipeline and Google Cloud Spanner connection in Preparation. (CDAP-14053)
-
Added Google Cloud PubSub realtime source. (CDAP-14185)
-
Added a new user onboarding tour to CDAP. (CDAP-14088)
-
Added the ability to customize UI through theme. (CDAP-13990)
-
Added a framework that can be used to bootstrap a CDAP instance. (CDAP-14022)
-
Added the ability to configure system wide provisioner properties that can be set by admins but not by users. (CDAP-13746)
-
Added capability to allow specifying requirements by plugins and filter them on the basis of their requirements. (CDAP-13924)
-
Added REST endpoints to query the run counts of a program. (CDAP-13975)
-
Added a REST endpoint to get the latest run record of multiple programs in a single call. (CDAP-14260)
-
Added support for Apache Spark 2.3. (CDAP-13653)
Improvements
-
Improved runtime monitoring (which fetches program states, metadata and logs) of remotely launched programs from the CDAP Master by using dynamic port forwarding instead of HTTPS for communication. (CDAP-13566)
-
Removed duplicate classes to reduce the size of the sandbox by a couple hundred megabytes. (CDAP-13977)
-
Added cdap-env.sh to allow configuring jvm options while launching the Sandbox. (CDAP-14461)
-
Added support for bidirectional Field Level Lineage. (CDAP-14003)
-
Added capability for external dataset to record their schema. (CDAP-14013)
-
The Dataproc provisioner will try to pick up the project id and credentials from the environment if they are not specified. (CDAP-14091)
-
The Dataproc provisioner will use internal IP addresses when CDAP is in the same network as the Dataproc cluster. (CDAP-14104)
-
Added capability to always display current dataset schema in Field Level Lineage. (CDAP-14168)
-
Improved error handling in Preparation. (CDAP-13886)
-
Added a FileSink batch sink, FileMove action, and FileDelete action to replace their HDFS counterparts. (CDAP-14023)
-
Added a configurable jvm option to kill CDAP process immediately on sandbox when an OutOfMemory error occurs. (CDAP-14097)
-
Added better trace logging for dataset service. (CDAP-14135)
-
Make Google Cloud Storage, Google Cloud BigQuery, and Google Cloud Spanner connection properties optional (project id, service account keyfile path, temporary GCS bucket). (CDAP-14386)
-
Google Cloud PubSub sink will try to create the topic if it does not exist while preparing for the run. (CDAP-14401)
-
Added csv, tsv, delimited, json, and blob as formats to the S3 source and sink. (CDAP-14475)
-
Added csv, tsv, delimited, json, and blob as formats to the File source. (CDAP-14321)
-
Added a button on external sources and sinks to jump to the dataset detail page. (CDAP-9048)
-
Added format and suppress query params to the program logs endpoint to match the program run logs endpoint. (CDAP-14040)
-
Made all CDAP examples to be compatible with Spark 2. (CDAP-14132)
-
Added worker and master disk size properties to the Dataproc provisioner. (CDAP-14220)
-
Improved operational behavior of the dataset service. (CDAP-14298)
-
Fixed wrangler transform to make directives optional. If none are given, the transform is a no-op. (CDAP-14372)
-
Fixed Preparation to treat files wihtout extension as text files. (CDAP-14397)
-
Limited the number of files showed in S3 and Google Cloud Storage browser to 1000. (CDAP-14398)
-
Enhanced Google Cloud BigQuery sink to create dataset if the specified dataset does not exist. (CDAP-14482)
-
Increased log levels for the CDAP Sandbox so that only CDAP classes are at debug level. (CDAP-14489)
Bug Fixes
-
Fixed the 'distinct' plugin to use a drop down for the list of fields and to have a button to get the output schema. (CDAP-14468)
-
Ensured that destroy() is always called for MapReduce, even if initialize() fails. (CDAP-7444)
-
Fixed a bug where Alert Publisher will not work if there is a space in the label. (CDAP-13008)
-
Fixed a bug that caused Preparation to fail while parsing avro files. (CDAP-13230)
-
Fixed a misleading error message about hbase classes in cloud runtimes. (CDAP-13878)
-
Fixed a bug where the metric for failed profile program runs was not getting incremented when the run failed due to provisioning errors. (CDAP-13887)
-
Fixed a bug where querying metrics by time series will be incorrect after a certain amount of time. (CDAP-13894)
-
Fixed a bug where profile metrics is incorrect if an app is deleted. (CDAP-13959)
-
Fixed a deprovisioning bug when cluster creation would fail. (CDAP-13965)
-
Fixed an error where TMS publishing was retried indefinitely if the first attempt failed. (CDAP-13988)
-
Fixed a race condition in MapReduce that can cause a deadlock. (CDAP-14076)
-
Fixed a resource leak in preview feature. (CDAP-14098)
-
Fixed a bug that would cause RDD versions of the dynamic scala spark plugins to fail. (CDAP-14107)
-
Fixed a bug where profiles were getting applied to all program types instead of only workflows. (CDAP-14154)
-
Fixed a race condition by ensuring that a program is started before starting runtime monitoring for it. (CDAP-14203)
-
Fixed runs count for pipelines in UI to show correct number instead of limiting to 100. (CDAP-14211)
-
Fixed an issue where Dataproc client was not being closed, resulting in verbose error logs. (CDAP-14223)
-
Fixed a bug that could cause the provisioning state of stopped program runs to be corrupted. (CDAP-14261)
-
Fixed a bug that caused Preparation to be unable to list buckets in a Google Cloud Storage connection in certain environments. (CDAP-14271)
-
Fixed a bug where Dataproc provisioner is not able to provision a singlenode cluster. (CDAP-14303)
-
Fixed a bug where Preparation could not read json or xml files on Google Cloud Storage. (CDAP-14390)
-
Fixed dataproc provisioner ...
CDAP 5.0
Summary
-
Cloud Runtime
- Cloud Runtimes allow you to configure batch pipelines to run in a cloud environment. - Before the pipeline runs, a cluster is provisioned in the cloud. The pipeline is executed on that cluster, and the cluster is deleted after the run finishes. - Cloud Runtimes allow you to only use compute resources when you need them, enabling you to make better use of your resources.
-
Metadata
- Metadata Driven Processing - Annotate metadata to custom entities such as fields in a dataset, partitions of a dataset, files in a fileset - Access metadata from a program or plugin at runtime to facilitate metadata driven processing - Field Level Lineage - APIs to register operations being performed on fields from a program or a pipeline plugin - Platform feature to compute field level lineage based on operations
-
Analytics
- A simple, interactive, UI-driven approach to machine learning. - Lowers the bar for machine learning, allowing users of any level to understand their data and train models while preserving the switches and levers that advanced users might want to tweak.
-
Operational Dashboard
- A real-time interactive interface that visualizes program run statistics - Reporting for comprehensive insights into program runs over large periods of time
New Features
Cloud Runtime
........................
-
Added Cloud Runtimes, which allow users to assign profiles to batch pipelines that control what environment the pipeline will run in. For each program run, a cluster in a cloud environment can be created for just that run, allowing efficient use of resources. (CDAP-13089)
-
Added a way for users to create compute profiles from UI to run programs in remote (cloud) environments using one of the available provisioners. (CDAP-13213)
-
Allowed users to specify a compute profile in UI to run the pipelines in cloud environments. Compute profiles can be specified either while running a pipeline manually or via a time schedule or via a pipeline state based trigger. (CDAP-13206)
-
Added a provisioner that allows users to run pipelines on Google Cloud Dataproc clusters. (CDAP-13094)
-
Added a provisioner that can run pipelines on remote Apache Hadoop clusters (CDAP-13774)
-
Added an Amazon Elastic MapReduce provisioner that can run pipelines on AWS EMR. (CDAP-13709)
-
Added support for viewing logs in CDAP for programs executing using the Cloud Runtime. (CDAP-13380)
-
Added metadata such has pipelines, schedules and triggers that are associated with profiles. Also added metrics such as the total number of runs of a pipeline using a profile. (CDAP-13432)
-
Added the ability to disable and enable a profile (CDAP-13494)
-
Added the capability to export or import compute profiles (CDAP-13276)
-
Added the ability to set the default profile at namespace and instance levels. (CDAP-13359)
Metadata
................
-
Added support for annotating metadata to custom entities. For example now a field in a dataset can be annotated with metadata. (CDAP-13260)
-
Added programmatic APIs for users to register field level operations from programs and plugins. (CDAP-13264)
-
Added REST APIs to retrieve the fields which were updated for a given dataset in a given time range, a summary of how those fields were computed, and details about operations which were responsible for updated those fields. (CDAP-13269)
-
Added the ability to view Field Level Lineage for datasets. (CDAP-13511)
Analytics
...............
- Added CDAP Analytics as an interactive, UI-driver application that allows users to train machine learning models and use them in their pipelines to make predictions. (CDAP-13921)
Operational Dashboard
......................................
-
Added a Dashboard for real-time monitoring of programs and pipelines (CDAP-12865)
-
Added a UI to generate reports on programs and pipelines that ran over a period of time (CDAP-12901)
-
Added feature to support Reports and Dashboard. Dashboard provides realtime status of program runs and future schedules. Reports is a tool for administrators to take a historical look at their applications program runs, statistics and performance (CDAP-13147)
Other New Features
.................................
Data Pipelines
^^^^^^^^^^^^^^
-
Added 'Error' and 'Alert' ports for plugins that support this functionality. To enable this functionality in your plugin, in addition to emitting alerts and errors from the plugin code, users have to set "emit-errors: true" and "emit-alerts: true" in their plugin json. Users can create connections from 'Error' port to Error Handlers plugins, and from 'Alert' port to Alert plugins (CDAP-12839)
-
Added support for Apache Phoenix as a source in Data Pipelines. (CDAP-13045)
-
Added support for Apache Phoenix database as a sink in Data Pipelines. (CDAP-13499)
-
Added the ability to support macro behavior for all widget types (CDAP-12944)
-
Added the ability to view all the concurrent runs of a pipeline (CDAP-13057)
-
Added the ability to view the runtime arguments, logs and other details of a particular run of a pipeline. (CDAP-13006)
-
Added UI support for Splitter plugins (CDAP-13242)
Data Preparation
^^^^^^^^^^^^^^^^
-
Added a Google BigQuery connection for Data Preparation (CDAP-13100)
-
Added a point-and-click interaction to change the data type of a column in the Data Preparation UI (CDAP-12880)
Miscellaneous
^^^^^^^^^^^^^
-
Added a page to view and manage a namespace. Users can click on the current namespace card in the namespace dropdown to go the namespace's detail page. In this page, they can see entities and profiles created in this namespace, as well as preferences, mapping and security configurations for this namespace. (CDAP-13180)
-
Added the ability to restart CDAP programs to make it resilient to YARN outages. (CDAP-12951)
-
Implemented a new Administration page, with two tabs, Configuration and Management. In the Configuration tab, users can view and manage all namespaces, system preferences and system profiles. In the Management tab, users can get an overview of system services in CDAP and scale them. (CDAP-13242)
Improvements
-
Added Spark 2 support for Kafka realtime source (CDAP-13280)
-
Added support for CDH 5.13 and 5.14. (CDAP-12727
-
Added support for EMR 5.4 through 5.7 (CDAP-11805)
-
Upgraded CDAP Router to use Netty 4.1 (CDAP-6308)
-
Added support for automatically restarting long running program types (Service and Flow) upon application master process failure in YARN (CDAP-13179)
-
Added support for specifying custom consumer configs in Kafka source (CDAP-12549)
-
Added support for specifying recursive schemas (CDAP-13143)
-
Added support to pass in YARN application ID in the logging context. This can help in correlating the ID of the program run in CDAP to the ID of the corresponding YARN application, thereby facilitating better debugging. (CDAP-12275)
-
Added the ability to deploy plugin artifacts without requiring a parent artifact. Such plugins are available for use in any parent artifacts (CDAP-9080)
-
Added the ability to import pipelines from the add entity modal (plus button) (CDAP-12274)
-
Added the ability to save the runtime arguments of a pipeline as preferences, so that they do not have to be entered again. (CDAP-11844)
-
Added the ability to specify dependencies to ScalaSparkCompute Action (CDAP-12724)
-
Added the ability to update the keytab URI for namespace's impersonation configuration. (CDAP-12426)
-
Added the ability to upload a User Defined Directive (UDD) using the plus button (CDAP-12279)
-
Allowed CDAP user programs to talk to Kerberos enabled HiveServer2 in the cluster without using a keytab (CDAP-12963)...
Cask Data Application Platform - 4.3.4
Improvements
-
Macro enabled all fields in the HTTP Callback plugin (CDAP-13116)
-
Optimized the planner to reduce the amount of temporary data required in certain types of mapreduce pipelines. (CDAP-13119)
-
Minor optimization to reduce the number of mappers used to read intermediate data in mapreduce pipelines (CDAP-13122)
-
Improves the schema generation for database sources. (CDAP-13139)
-
Automatic restart of long running program types (Service and Flow) upon application master process failure in YARN (CDAP-13179)
Bug Fixes
-
Fixed a bug that caused errors in the File source if it read parquet files that were not generated through Hadoop. (CDAP-12875)
-
Fixed an issue where a dataset's class loader was closed before the dataset itself, preventing the dataset from closing properly. (CDAP-13110)
-
Fixed a bug that caused directories to be left around if a workflow used a partitioned fileset as a local dataset (CDAP-13120)
-
Fixed a bug that caused a hive Explore query on Streams to not work. (CDAP-13123)
-
Fixed a planner bug to ensure that sinks are never placed in two different mapreduce phases in the same pipeline. (CDAP-13129)
-
Fixed a race condition when running multiple spark programs concurrently at a Workflow fork that can lead to workflow failure (CDAP-13158)
-
Fixed an issue with creating a namespace if the namespace principal is not a member of the namespace home's group. (CDAP-13171)
-
Fixed a bug that caused completed run records to be missed when storing run state, resulting in misleading log messages about ignoring killed states. (CDAP-13191)
-
Fixed a bug in FileBatchSource that prevented ignoreFolders property from working with avro and parquet inputs (CDAP-13192)
-
Fixed an issue where inconsistencies in the schedulestore caused scheduler service to keep exiting. (CDAP-13205)
-
Fixed an issue that would cause changes in program state to be ignored if the program no longer existed, resulting in the run record corrector repeatedly failing to correct run records (CDAP-13217)
-
Fixed the state of Workflow, MapReduce, and Spark program to be reflected correctly as KILLED state when user explicitly terminated the running program (CDAP-13218)
-
Fixed directive syntaxes in point and click interactions for some date formats (CDAP-13223)
Cask Data Application Platform - 4.3.3
Improvements
- GroupBy aggregator plugin fields are now macro enabled. (CDAP-12942)
- Allow CDAP user programs to talk to Kerberos enabled HiveServer2 in the cluster without using a keytab. (CDAP-12963)
- Removed concurrent upgrades of HBase coprocessors since it could lead to regions getting stuck in transit. (CDAP-12974)
Bug Fixes
- Fixed a bug that prevented MapReduce AM logs from YARN to show the right URI. (CDAP-7052)
- Added CLI command to fetch service logs. (CDAP-7644)
- Increased the dataset changeset size and limit to integer max by default. (CDAP-12774)
- Fixed a bug where macro for output schema of a node was not saved when the user closed the node properties modal. (CDAP-12900)
- Fixed a bug where explore queries would fail against paths in HDFS encryption zones, for certain Hadoop distributions. (CDAP-12930)
- Fixed a bug where the old connection is not removed from the pipeline config when you move the connection's pointer to another node. (CDAP-12945)
- Fixed a bug in the pipeline planner where pipelines that used an action before multiple sources would either fail to deploy or deploy with an incorrect plan. (CDAP-12946)
- Fixed a dependency bug that could cause HBase region servers to deadlock during a cold start. (CDAP-12970)
- Fixed an issue with the retrieval of non-ASCII strings from Table datasets. (CDAP-13002)
- Messaging table coprocessor now gets upgraded when the underlying HBase version is changed without any change in the CDAP version. (CDAP-13021)
- Fixed a bug that prevented a parquet snapshot source and sink to be used in the same pipeline. (CDAP-13026)
- Fixed a bug in TMS that prevented correctly consuming multiple events emitted in the same transaction. (CDAP-13033)
- Make TransactionContext resilient against getTransactionAwareName() failures. (CDAP-13037)
- Fixed avro fileset plugins so that reserved hive keywords can be used as column names. (CDAP-13040)
Cask Data Application Platform - 4.3.2
New Features
- Added GCS connection to Data Prep. (CDAP-12771)
- Added S3 connection to Data Prep. (CDAP-12018)
Improvements
- Added support for EMR 5.4 through 5.7. (CDAP-11805)
- Added support for CDH 5.13.0. (CDAP-12727)
Bug Fixes
- Minimize master's local storage usage by deleting the temporary directories created on the cdap-master for programs as soon as programs are launched on the cluster. (CDAP-6032)
- Fixed an issue where UI was looking for the wrong property for SSL port. (CDAP-12682)
- Fixed a bug that causes PySpark fail to run with Spark 2 in local sandbox. (CDAP-12693)
- Packaging the SLF4J Log with the plugins rather than depending on the system settings. (CDAP-12701)
- Fixed an issue with that caused HBase Sink to fail when used alongside other sinks, using the Spark execution engine. (CDAP-12731)
- Fixed a bug where the Scala Spark compiler has missing classes from Classloader, causing compilation failure. (CDAP-12743)
- Fixed a bug that failed to run Spark program when Spark authentication is turned on. (CDAP-12752)
- Fixed an issue with running the dynamic Scala Spark plugin on Windows. Directory which is used to store the compiled scala classes now contains '.' as a separator instead of ':' which was causing failure on Windows machines. (CDAP-12769)
- Fixed an issue that prevented auto-fill of schema for Datasets created by an ORC sink plugin. (CDAP-12843)
Cask Data Application Platform - 4.1.3
Improvements
- Improved memory usage of data pipeline with joiner in mapreduce execution engine. (CDAP-12541)
Bug Fixes
- Add support for CDH 5.12.0. (CDAP-12022)
- Fixed a thread leakage issue in when using Spark Streaming in SDK and unit-test. (CDAP-11939)
- Fixed a permgen memory leak issue when using Spark SQL in SDK and unit-test. (CDAP-11874)
- Fixed Spark HiveContext missing configuration issue in CDH 5.10+. (CDAP-12557)
Cask Data Application Platform - 4.3.1
New Features
-
Adds new visualization tool to give insights about data prepped up in data preparation tool. (CDAP-12592)
-
Adds a way to trigger invalid transaction pruning via a REST endpoint. (CDAP-12620)
-
Adds UI to make HTTP request in CDAP. (CDAP-12595)
Improvements
-
Added a downgrade command to the pipeline upgrade tool, allowing users to downgrade pipelines to a previous version. (CDAP-12598)
-
Improved memory usage of data pipeline with joiner in mapreduce execution engine. (CDAP-12541)
-
Added ability to select/clear all the checkboxes for Provided runtime arguments (CDAP-12176)
-
Fixed a performance issue with the run record corrector (CDAP-12646)
-
Added a capability to configure program containers memory settings through runtime arguments and preferences (CDAP-12380)
-
Applies the extra jvm options configuration to all task containers in MapReduce (CDAP-8499)
-
Fixed a classloader leakage issue when PySpark is used in local sandbox (CDAP-12546)
-
Ability to list the datasets based on the set of dataset properties (CDAP-12593)
Bug Fixes
-
MapReduce Task-related metrics will be emitted from individual tasks instead of MapReduce driver. (CDAP-12645)
-
Fixed the filter if missing flow in the UI to also apply on null values in addition to empty values. (CDAP-12628)
-
Fixed the fill-null-or-empty directive to allow spaces in the default value (CDAP-12612)
-
Fixed a bug that authorization cannot be turned on if kerberos is disabled (CDAP-12588)
-
Fixed an issue that caused the pipeline upgrade tool to upgrade pipelines in a way that would cause UI failures when the upgraded pipeline is viewed. (CDAP-12578)
-
Spark compat directories in the system artifact directory will now be automatically checked, regardless of whether they are explicitly set in app.artifacts.dir. (CDAP-12577)
-
Added option to enable/disable emitting program metrics and option to include or skip task level information in metrics context. This option can be used with scoping at program and program-type level similar to setting system resources with scoping. (CDAP-12570)
-
Improved error messaging when there is an error while in publishing metrics in MetricsCollection service. (CDAP-12569)
-
Fixed a bug that CDAP is not able to clean up and aggregate on streams in an authorization enabled environment (CDAP-12567)
-
Fixed log message format to include class name and line number when logged in master log (CDAP-12559)
-
Adds various improvements to the transaction system, including the ability to limit the size of a transaction's change set; better insights into the cause of transaction conflicts; improved concurrency when writing to the transaction log; better handling of border conditions during invalid transaction pruning; and ease of use for the transaction pruning diagnostic tool. (CDAP-12526)
-
Fix the units for YARN memory stats on Administration UI page. (CDAP-12495)
-
Fixed a bug where the app detail contains entity information that the user does not have any privilege on (CDAP-12482)
-
Fixed preview results for pipelines with condition stages (CDAP-12476)
-
Fixed a bug that caused failures for Hive queries using MR execution engine in CM 5.12 clusters. (CDAP-12457)
-
Fixes an issue where transaction coprocessors could sometimes not access their configuration. (CDAP-12454)
-
UI: Add ability to view payload configuration of pipeline triggers (CDAP-12451)
-
Fixed a bug that the cache timeout was not changed with the value of
security.authorization.cache.ttl.secs
(CDAP-12441) -
Fixed an issue with not able to use HiveContext in Spark (CDAP-12415)
-
Added the authorization policy for adding/deleting schedules (CDAP-12387)
-
Fixes an issue where the transaction service could hang during shutdown. (CDAP-12377)
-
Fixed issue where loaded data was not consistently rendering when navigating to Data Preparation from other parts of CDAP. (CDAP-12333)
-
Improves the performance of HBase operations when there are many invalid transactions. (CDAP-12314)
-
Improves a previously misleading log message. (CDAP-12240)
-
Fixed an issue that hive query may failed if the configuration has too many variables substitution. (CDAP-7651)
-
Added mechanism to clean up local dataset if the workflows creating them are killed (CDAP-7243)
-
Improved the error message in the case that a kerberos principal is deleted or keytab is invalid, during impersonation. (CDAP-7049)
Cask Data Application Platform - 4.3.0
Summary
1. Data Pipelines:
- Support for conditional execution of parts of a pipeline
- Ability for pipelines to trigger other pipelines for cross-team, cross-pipeline inter-connectivity, and to build complex interconnected pipelines.
- Improved pipeline studio with redesigned nodes, undo/redo capability, metrics
- Automated upgrade of pipelines to newer CDAP versions
- Custom icons and labels for pipeline plugins
- Operational insights into pipelines
2. Data Preparation:
- Support for User Defined Directives (UDD), so users can write their own custom directives for cleansing/preparing data.
- Restricting Directive Usage and ability to alias Directives for your IT Administrators to control directive access
3. Governance & Security:
- Standardized authorization model
- Apache Ranger Integration for authorization of CDAP entities
4. Enhanced support for Apache Spark:
- PySpark Support so data scientists can develop their Spark logic in Python, while still taking advantage of enterprise integration capabilities of CDAP
- Spark Dataframe Support so Spark developers can access CDAP datasets as Spark DataFrames
5. New Frameworks and Tools:
- Microservices for real-time IoT use cases.
- Distributed Rules Engine - for Business Analysts to effectively manage rules for data transformation and data policy
New Features
Data Pipelines Enhancements
-
Added a new splitter transform plugin type that can send output to different ports. Also added a union splitter transform that will send records to different ports depending on which type in the union it is and a splitter transform that splits records based on whether the specified field is null. (CDAP-12033)
-
Added a way for pipeline plugins to emit alerts, and a new AlertPublisher plugin type that publishes those alerts. Added a plugin that publishes alerts to CDAP TMS and an Apache Kafka Alert Publisher plugin to publish alerts to a Kafka topic. (CDAP-12034)
-
Batch data pipelines now support condition plugin types which can control the flow of execution of the pipeline. Condition plugins in the pipeline have access to the stage statistics such as number of input records, number of output records, number of error records generated from the stages which executed prior to the condition node. Also implemented Apache Commons JEXL based condition plugin which is available by default for the batch data pipelines. (CDAP-12108)
-
Plugin
prepareRun
andonFinish
methods now run in a separate transaction per plugin so that pipelines with many plugins will not timeout. (CDAP-12167) -
All pipeline plugins now have access to the pipeline namespace and name through their context object. (CDAP-12191)
-
Added a feature that allows undoing and redoing of actions in pipeline Studio. (CDAP-9107)
-
Made pipeline nodes bigger to show the version and metrics on the node. (CDAP-12057)
-
Revamped pipeline connections, to allow dropping a connection anywhere on the node, and allow selecting and deleting multiple connections using the Delete key. (CDAP-12077)
-
Added an automated UI flow for users to upgrade pipelines to newer CDAP versions. (CDAP-10619)
-
Added visualization for pipeline in UI. This helps visualizing runs, logs/warnings and data flowing through each node for each run in the pipeline. (CDAP-11889)
-
Added support for plugins of plugins. This allows the parent plugin to expose some APIs that its own plugins will implement and extend. (CDAP-12111)
-
Added ability to support custom label and custom icons for pipeline plugins. (CDAP-12114)
-
BatchSource, BatchSink, BatchAggregator, BatchJoiner, and Transform plugins now have a way to get SettableArguments when preparing a run, which allows them to set arguments for the rest of the pipeline. (CDAP-10974)
-
Runtime arguments are now available to the script plugins such as Javascript and Python via the Context object. (CDAP-10653)
-
Added a method to PluginContext that will return macro evaluated plugin properties. (CDAP-12472)
-
Enhanced add field transform plugin to add multiple fields. (CDAP-12094)
Triggers
-
Added capabilities to trigger programs and data pipelines based on status of other programs and data pipelines. (CDAP-11912)
-
Added the capability to use plugin properties and runtime arguments from the triggering data pipeline as runtime arguments in the triggered data pipeline. (CDAP-12382)
-
Added composite AND and OR trigger. (CDAP-12232)
Data Preparation Enhancements
-
Added the ability for users to connect Data Preparation to their existing data in Apache Kafka. (CDAP-11618)
-
Added point and click interaction for performing various calculations on data in Data Prep. (CDAP-12092)
-
Added point and click interaction for applying custom transformations in Data Prep. (CDAP-12118)
-
Added point and click interaction to mask column data. (CDAP-9530)
-
Added point and click interaction to encode/decode column data. (CDAP-9532)
-
Added point and click interaction to parse Avro and Excel files. (CDAP-11869)
-
Added point and click interaction for replacing column names in bulk. (CDAP-11977)
-
:cask-issue:
CDAP-12091
- Added point and click interaction for defining and incrementing variable. (CDAP-12091)
Spark Enhancements
- Added capabilities to run PySpark programs in CDAP. (CDAP-4871)
Governance and Security Enhancements
-
Implemented the new authorization model for CDAP. The old authorization model is no longer supported. (CDAP-12134)
-
Added a new configuration
security.authorization.extension.jar.path
in cdap-site.xml which can be used to add extra classpath and is avalible to cdap security extensions. (CDAP-12317) -
Removed automatic grant/revoke privileges on CDAP entity creation/deletion. (CDAP-12100)
-
Added support for authorization on Kerberos principal for impersonation. (CDAP-12367)
-
Modified the authorization model so that read/write on an entity will not depend on its parent. (CDAP-11839)
-
Deprecated
createFilter()
and added a newisVisible
API in AuthorzationEnforcer. Deprecated grant/revoke APIs for EntityId and added new one for Authorizable which support wildcard privileges. (CDAP-12135) -
Removed version for artifacts for authorization policy to be consistent with applications. From 4.3 onwards CDAP does not support policies on artifact/application version. (CDAP-12283)
Other New Features
-
Added a wizard to allow configuring and deploying microservices in UI. (CDAP-11940)
-
Enabled GC logging for CDAP services. (CDAP-6329)
-
Added support for HDInsight 3.6. (CDAP-11448)
-
CSD now performs a version compatibility check with the active CDAP Parcel. (CDAP-4874)
-
Added live migration of metrics tables from pre 4.3 tables to 4.3 salted metrics tables. (CDAP-12348)
-
Added capability to salt the row key of the metrics tables so that writes are evenly distributed and there is no region hot spotting. (CDAP-12017)
-
Added a REST API to check the status of metrics processor. We can view the topic level processing stats using this endpoint. (CDAP-12068)
-
Added option to disable/enable metrics for a program through runtime arguments or preferences. This feature can also be used system wide by enabling/disabling metrics in cdap-site.xml. (CDAP-12070)
-
Added global "CDAP" config to enable/disable metrics emission from user programs.By default metrics is enabled. (CDAP-12290)
-
DatasetOutputCommiter's methods are now executed in the MapReduce ApplicationMaster, within OutputCommitter's commitJob/abortJob methods. The MapReduceContext.addOutput(Output.of(String, OutputFormatProvider)) API can no longer be used to add OutputFormatProviders that also implement the DatasetOutputCommi...
Cask Data Application Platform - 4.1.2
Improvements
-
Reuse network connections for TMS client. (CDAP-12020)
-
Added a way to limit the frequency of retrieving the MapReduce task report, which could cause network load for very large jobs. (CDAP-11959)
-
Added the ability to configure the HBase client scanner cache for a dataset. (CDAP-11949)
-
Added startup check for CDAP master to error out if the configurations for HBaseDDLExecutor extensions are provided, however extension jar cannot be loaded. (CDAP-11594)
-
Upgraded IDEA IntelliJ IDE in CDAP SDK VM to 2017.1.3 release. (CDAP-11444)
-
Upgraded Eclipse IDE in CDAP SDK VM to Neon 3 release. (CDAP-11398)
-
Added the ability to denormalize data, by splitting based on de-limiter text or array flattening, to individual records in Dataprep UI as point and click directive. (CDAP-9515)
-
Added the ability to apply some DataPrep directives on multiple columns, starting with Join columns and Swap columns. Multiple columns can be selected by checking the checkbox next to each column's name, then selecting a directive in the directive dropdown. (CDAP-9514)
-
Added the ability to format data (date time, string formatting etc.,) in Dataprep UI as point and click directive. (CDAP-9507)
-
Added the ability to extract text using regex patterns in Dataprep UI as point and click directive. (CDAP-9523)
-
Added feature where macro arguments are also listed in the runtime arguments of preview mode, just like when running a new pipeline. (CDAP-9096)
-
Added feature where values of macro arguments are automatically populated and shown in the UI when running a pipeline, if those values exist as Preferences. (CDAP-9094)
-
Enable GC logging for cdap services. (CDAP-6329)
Bug Fixes
-
Fixed a bug that UGI provider returns the old and incorrect UGI information. (CDAP-11985)
-
Fixed a bug that sometimes wrong user is used in explore, which results in the failure of deleting namespace. (CDAP-11955)
-
Fixed a bug where committed data could be removed during HBase table flush or compaction. (CDAP-11948)
-
Fixed an issue where a failed MapReduce run was marked as successful. (CDAP-11937)
-
Fixed a bug that hydrator pipelines and other programs do not create datasets at runtime with correct impersonated user. (CDAP-11880)
-
Fixed impersonation when upgrading datasets in UpgradeTool (CDAP-11815)
-
Fixed an issue with retrieving workflow state if it contains an exception without a message. (CDAP-11795)
-
HBaseDDLExecutor implementation is now localized to the containers without adding it in the container classpath. (CDAP-11783)
-
Fixed delete button on action plugins to allow users to delete easily. (CDAP-10488)
-
Fixed a bug that impersonated workflow does not create local datasets with correct impersonated user. (CDAP-9456)
-
Fixed issue in explore preview where UI is not displaying boolean value correctly (CDAP-8963)
-
Fixed an issue where Workflow driver was getting restarted when it runs out of memory, causing the Workflow to be executed from start node again. (CDAP-5067)