Releases: cdapio/cdap
CDAP 6.2.2
Summary
This release introduces a number of improvements and bug fixes to CDAP. Some of the main highlights of the release are:
-
Joiner plugin improvements
- Added distribution support in the Joiner plugin to improve performance for skewed joins.
-
Wrangler improvements
- Added support for BigQuery views and materialized views in Wrangler.
-
BigQuery Source plugin improvements
- Added views and materialized views support to BigQuery source.
-
Preview Improvements
- Improved the scale of the preview system when CDAP is ran on k8s environment. Preview UI tab is revamped with new record view.
New Features
- Added revamped preview tab with new Record view for large schemas. (CDAP-16690)
Improvements
-
Adding support for creating autoscale dataproc cluster. (CDAP-16668)
-
When the system is experiencing slowness, users now see a message saying there's a delay. (CDAP-16682)
-
Improved the scalability of the preview system when running in Kubernetes environment by separating out preview runs in their own individual pods. Preview manager pod now only responsible for handling preview REST api. (CDAP-16712)
-
Updated Preview to show number of preview runs pending before current run (if there are any runs pending). The number of pending runs is shown under the timer in the UI. (CDAP-17015)
-
Changed the auto-caching strategy in Spark pipelines to default to using disk only caching instead of memory due to common out of memory failures. Also changed the caching strategy to only cache at places that would prevent sources from being recomputed instead of the more aggressive caching previously done. (CDAP-17077)
-
Added an experimental setting to consolidate multiple pipeline branches into single operations in Spark pipelines. This can improve performance in pipelines by avoiding recomputation. This can be turned on by setting a preference or runtime argument for 'spark.cdap.pipeline.consolidate.stages' to 'true'. (CDAP-17078)
-
Added Distribution to AutoJoiner API to increase performance for skewed joins. (CDAP-17095)
-
Make "records.updated" metric available for GCS Batch Sink plugin. (CDAP-17123)
-
Added joiner distribution support to MapReduce and streaming pipelines. (CDAP-17130)
-
Add new properties
Filesystem properties
andOutput File Prefix
for GCS Sink. (CDAP-17179) -
Enable traffic compression in runtime service. (CDAP-17182)
-
Added Runtime service to the system service statues. (CDAP-17198)
-
Added distribution settings to Joiner plugin for increased performance in skewed joins. (PLUGIN-303)
-
Added support for BigQuery Views and Materialized Views to Wrangler. (PLUGIN-386)
Bug Fixes
- Clarified error message for when branches of a conditional are used as inputs to the same node. (CDAP-12499)
- Fixed bug that reset date range when navigating from dataset lineage to field level lineage. (CDAP-15214)
- Fixed issue where Dashboard will show graphs when there is no run. (CDAP-16732)
- Fixes UI to show plugin configuration for plugins that does not have a widget json support from the plugin artifact. (CDAP-16824)
- Fixed bug that did not fetch Preview data when the plugin label had spaces in it. (CDAP-16898)
- Fixes the bug for showing dropdown menu for wrangler tabs to be correct. Existing dropdown overlapped with other UI elements hindering the usage of UI. (CDAP-17043)
- Fixes the bug to allow large pipelines with
-
in the name to properly overflow in UI. (CDAP-17045) - Fixed bug that did not allow user to make further changes to preferences when saving preferences returned an error. (CDAP-17057)
- Fix styling bug so header of preview tab does not scroll with table. (CDAP-17117)
- Fixed tab styles for users on Mac with system preferences set to show scrollbars always in Chrome. (CDAP-17133)
- Fixed bug that showed preview pipeline stopping in UI even when call to stop pipeline returns error. (CDAP-17137)
- Fixed a bug that caused empty error banner to appear when user stops preview. (CDAP-17138)
- Fixed styling of preview tab so that side by side tables and record tables are aligned. (CDAP-17139)
- Fixed bug so error banner for deploy failure shows failure details from backend status message, if they exist. (CDAP-17140)
- Fix bug that allowed user to make unsaved config changes by disabling pipeline config button in Preview mode when run is in progress. (CDAP-17141)
- Modified preview timer logic to use submitTime instead of pipeline run startTime, to take into account time spent in INIT and WAITING states. (CDAP-17145)
- Reduce memory footprint for program execution monitoring. (CDAP-17161)
- Fixed a bug that caused the setting for the number of executors in streaming pipelines to be ignored. (CDAP-17166)
- Fixed horizontal tab styling to handle mac system setting "scrolling always on" in chrome. (CDAP-17171)
- Fixed bug that showed banner about stopping pipeline when a pipeline was deployed after running preview. (CDAP-17172)
- Fix bug that doesn't allow user to stop preview if pipeline run has already completed. (CDAP-17174)
- Pickup Spark configuration correctly from the remote Hadoop cluster for program execution. (CDAP-17213)
- Fix overflow styling for long text in preview tables. (CDAP-17217)
- Fix an issue where Dashboard page will show the graph being full when there is no run during the time period selected. (CDAP-17224)
- Fixed a bug that caused pipeline deployment to fail if the pipeline contained comments. (CDAP-17225)
- Improved Wrangler error messages for incorrect syntax and errors in Wrangler command line. (CDAP-17233)
- Fixed a bug where the cluster's default Hadoop settings were not being used in pipelines. (CDAP-17237)
- Fixed bug in StandaloneMain which prematurely deletes the Authorizer classpath directories. (CDAP-17239)
- Hide Analytics and Rules Engine by default from UI. (CDAP-17243)
- Fixes pipeline exported in 6.1.x CDAP to be imported without changing plugin names in the pipeline. This prevents pipelines failing during preview or deployment when imported from 6.1.x version of CDAP to 6.2.x+ version. (CDAP-17246)
- Improved validations on GCS plugins to check for permissions on buckets, and improved error messages for users unable to access a GCS bucket. (PLUGIN-202)
- Fixed bug where blog file input formats are being split up in Hadoop jobs. (PLUGIN-367)
- Fix a bug where customer credential information has shown up in the validation logs. (PLUGIN-369)
- Fixed user experience issue where Bigtable sink and source plugins may fail deployment if they are unable to connect to the Bigtable service. (PLUGIN-372)
CDAP 6.1.4
Summary
This release provides performance and scalability improvements that increase developer productivity and optimize pipeline runtime performance. The release includes scaled-up previews that support up to 50 concurrent runs, capabilities to handle large and complex schemas in Pipeline Studio, an enhanced log viewer, and other critical improvements and fixes. Some of the highlights are:
-
Features
- Added support to create autoscaling Dataproc clusters.
- Added schema support feature in the UI to edit precision and scale.
- Improved memory performance in pipelines by utilizing disk only auto-caching strategy.
-
Performance and Scalability Improvements
- Supported 50 users running previews at the same time.
- Supported large and deeply nested schemas (>5K fields with 20+ levels of nesting).
- Added ability to optimize the performance of some pipelines with a new, experimental setting 'spark.cdap.pipeline.consolidate.stages'.
New Features
-
New Log Viewer feature which enables users to see the most recent logs. (CDAP-16980)
-
Added new options in CDAP CLI to take URI instead of host and port combination. (CDAP-16836)
-
Added revamped preview tab with new Record view for large schemas. (CDAP-16690)
Performance and Scalability Improvements
-
Added new Data Cacher plugin to allow users to manually cache data at certain points in a pipeline. (PLUGIN-282)
-
Enabled macro for Hostname, port and database name in database-specific plugins. (PLUGIN-174)
-
Added new properties
Filesystem properties
andOutput File Prefix
for Google Cloud Storage Sink. (CDAP-17179) -
Added joiner distribution support to MapReduce and streaming pipelines. (CDAP-17130)
-
Make "records.updated" metric available for Google Cloud Storage Batch Sink plugin. (CDAP-17123)
-
Added Distribution to AutoJoiner API to increase performance for skewed joins. (CDAP-17095)
-
Added an experimental setting to consolidate multiple pipeline branches into single operations in Spark pipelines. This can improve performance in pipelines by avoiding recomputation. This can be turned on by setting a preference or runtime argument for 'spark.cdap.pipeline.consolidate.stages' to 'true'. (CDAP-17078)
-
Changed the auto-caching strategy in Spark pipelines to default to using disk only caching instead of memory due to common out of memory failures. Also changed the caching strategy to only cache at places that would prevent sources from being recomputed instead of the more aggressive caching previously done. (CDAP-17077)
-
Improved the scalability of the preview system when running in Kubernetes environment by separating out preview runs in their own individual pods. Preview manager pod now only responsible for handling preview REST api. (CDAP-16712)
-
Created Best Practices guide for Spark engine tuning. (CDAP-16697)
-
When the backend is slow to respond to requests from UI, we now show a snackbar saying there's a delay. (CDAP-16682)
-
Added support for creating autoscale dataproc cluster. (CDAP-16668)
-
Introduced new schema editor for plugins in pipelines. The schema editor in addition to supporting large schemas (>5k fields) supports the ability to edit attributes for decimal types (precision & scale). (CDAP-16850)
-
Updated Preview to show number of preview runnings pending before current run (if there are any runs pending). The number of pending runs is shown under the timer in the UI. (CDAP-17015)
Bug Fixes
-
Fixed user experience issue where Google Cloud Bigtable sink and source plugins may fail deployment if they are unable to connect to the Google Cloud Bigtable service. (PLUGIN-372)
-
Fixed a bug where customer credential information has shown up in the validation logs. (PLUGIN-369)
-
Fixed bug where blog file input formats are being split up in Hadoop jobs. (PLUGIN-367)
-
Fixed Google Cloud BigQuery sink with macro table key validation. (PLUGIN-245)
-
Fixed a region error message discrepancy of Google Cloud BigQuery service API on their end. (PLUGIN-206)
-
Improved validations on Google Cloud Storage plugins to check for permissions on buckets, and improved error messages for users unable to access a Google Cloud Storage bucket. (PLUGIN-202)
-
Fixed horizontal tab styling to handle mac system setting "scrolling always on" in chrome. (CDAP-17171)
-
Fixed a bug that caused the setting for the number of executors in streaming pipelines to be ignored. (CDAP-17166)
-
Reduced memory footprint for program execution monitoring. (CDAP-17161)
-
Fixed a race condition that caused runtime monitoring not working properly when there are concurrent launching of programs, which result in program state not able to transit and missing metadata. (CDAP-17154)
-
Modified preview tab so that multiple input or outputs are shown with tabs in table mode. (CDAP-17153)
-
Fixed bug that allowed user to make unsaved config changes by disabling pipeline config button in Preview mode when run is in progress. (CDAP-17141)
-
Fixed bug so error banner for deploy failure shows failure details from backend status message, if they exist. (CDAP-17140)
-
Fixed styling of preview tab so that side by side tables and record tables are aligned. (CDAP-17139)
-
Fixed a race condition in stopping Spark program in Standalone that can cause stop to hang. (CDAP-17135)
-
Fixed tab styles for users on Mac with system preferences set to show scrollbars always in Chrome. (CDAP-17133)
-
Fixed styling bug so header of preview tab does not scroll with table. (CDAP-17117)
-
Fixed a bug that caused splitter transforms to be unable to fetch their output ports and schemas. (CDAP-17097)
-
Improved state transitions for starting pipelines in app fabric to increase stability if app fabric unexpectedly restarts. (CDAP-17074)
-
Fixed bug that did not allow users to make further changes to preferences when saving preferences returned an error. (CDAP-17057)
-
Fixed the bug to allow large pipelines with
-
in the name to properly overflow in the UI. (CDAP-17045) -
Validated Columns names for big query sink. (CDAP-17044)
-
Fixed the bug for showing dropdown menu for wrangler tabs to be correct. Existing dropdown overlapped with other UI elements hindering the usage of UI. (CDAP-17043)
-
Missing plugins in a pipeline would have properties button disabled with a tooltip. (CDAP-16930)
-
Preview shows logical types in iso format. (CDAP-16754)
-
Modified loading screen for preview tab. (CDAP-16747)
-
GraphQL errors now use standard page level error or error banner based on severity to display the errors. (CDAP-16414)
-
Preview displays logical types as strings. (CDAP-15869)
-
Clarified error message for when branches of a conditional are used as inputs to the same node. (CDAP-12499)
CDAP 6.2.1
Summary
This release introduces a number of new features, improvements, and bug fixes to CDAP. Some of the main highlights of the release are:
-
Joiner Performance Improvements
- Implemented performance improvements to joiner plugins. Joins can now also be performed in-memory if one side is small, and behavior on null keys can be chosen by the user.
-
Aggregator Plugin Performance Improvements
- Improved aggregator performance for Spark engine.
New Features
-
Added a new AutoJoiner API for plugins to implement. The new API leaves implementation details up to the application, which can perform join optimizations that were not possible with the older Joiner API (CDAP-16708).
-
Fixed joiner output schema generation to be deterministic, using the same ordering as they had in the input data (CDAP-16530).
-
Introduced a new aggregator API to achieve better performance when using Apache Spark engine (CDAP-16855).
-
Introduced a new REST API for getting all application details across all namespaces (CDAP-16918).
Improvements
-
Added the ability for Joiner plugins to specify whether null keys should match other null keys (CDAP-16711).
-
Added Spark parameter to limit Spark block size to prevent issues with joins (CDAP-16461).
-
Include logs emitted from the job main class as the Dataproc job logs (CDAP-16455).
-
When backend is slow to respond to requests from UI, the UI now shows a delay notification (CDAP-16682).
-
Revamped the preview tab with new Record view for large schemas (CDAP-16690).
-
Added support for upgrading applications via REST API. Example usage is to upgrade all pipelines in a namespace to use latest available artifacts (CDAP-16835).
-
Introduced a -l option in the CDAP CLI to take in URI in the new format http[s]://hostname:[port] (CDAP-16836).
-
Log viewer now allows users to see the most recent logs (CDAP-16980).
-
Limit to reading in 100 records across all input partitions in preview (CDAP-16606).
-
Removed modal showing pipeline JSON when users export pipelines. Instead, pipeline gets downloaded when users click "export pipeline" without the extra confirmation step (CDAP-16621).
-
Added payload compression support to messaging service (CDAP-16673).
-
Upgrade to use Dataproc API v1beta2 to allow endpoint config (CDAP-16676).
-
Implemented performance improvements to joiner plugins to cap the required memory to
around 4gb per executor instead of scaling up as the skewness of the join goes up. Joins can now also be performed
in-memory if one side is small, and behavior on null keys can be chosen by the user (CDAP-16709). -
Added a metric records.updated in BigQuery sink. This counts the total of all the inserts, updates and upserts into the sink (CDAP-16815).
-
Added the ability to select a Custom Dataproc Image. The complete URI for the custom image should be specified (CDAP-16929).
-
UI now adds the latest version of plugin, among the list of different versions of the
plugin, when added from the sidepanel in pipeline studio. If the user has already chosen a specific version
(older version) it defaults to that instead of the latest (CDAP-16975). -
UI resets the default version of plugins for specific user during upgrade. When users
upgrade from 6.1.2 to 6.1.3 or later UI will reset the default version of plugin the user has already chosen. Post
upgrade if the user uses the same plugin UI will choose the latest version of the same plugin (CDAP-16976). -
Changed default value of spark.network.timeout to 10 minutes to make pipeline execution more stable for shuffle heavy pipelines (CDAP-17000).
Bug Fixes
-
Removed redundant validations from BQ sink , this should reduce calls to BigQuery.getTable() (CDAP-17003).
-
Fixed joiner output schema generation to be deterministic, using the same ordering as they had in the input data (CDAP-16530).
-
Fixed preview to display logical types as strings (CDAP-15869).
-
Fixed the package references in the dynamic spark plugin to use io.cdap instead of co.cask (CDAP-16222).
-
Fixed the joiner plugin to allow a nullable key on one side and a non-nullable key on the other (CDAP-16340).
-
Fixed a bug where field lineage is incorrect when a source is directly connected to a sink (CDAP-16367).
-
Fixed regex for empty filter in Wrangler UI (CDAP-16487).
-
Fixed a bug that the GroupBy aggregator requires a different alias for the field name (CDAP-16731).
-
Fixed a bug where memory, cpu, and engine config properties were not being set for sparkprogram plugins (CDAP-16760).
-
Fixes listing pipelines by tags in Pipelines list page (CDAP-16786).
-
CDAP UI now validates post run actions before adding to pipeline in studio (CDAP-16797).
-
Fixed a bug that started running preview for pipelines with post-run actions even if user chose option to not run preview (CDAP-16845).
-
Fixed PySpark support to work with Spark 2.1.3+ (CDAP-16870).
-
'Truncate table' and 'update schema' options if set together, will apply only WRITE_TRUNCATE to BQ job (CDAP-16879).
-
Removed schema validation from BQ sink when 'truncate table' option is set (CDAP-16880).
-
Unsupported pipelines in drafts would be upgraded when users open them (CDAP-16891).
-
Added validation to ensure that account name ends with ".blob.core.windows.net" in the Azure Blob Store plugin (CDAP-16927).
-
Includes all ERROR level logs logged under the application logging context (CDAP-16950).
-
Fixed an issue with runtime arguments re-rendering and losing focus when containing macros in preview (CDAP-16959).
-
Fixed an issue where preview config would open when trying to stop a preview (CDAP-16972).
-
Fixed a bug in preview for fields that have non-string types such as bytes (CDAP-16993).
-
Fixed an issue that caused an extra empty row to appear when sampling GCS text files in Wrangler (CDAP-17029).
-
Fixes the bug for showing dropdown menu for wrangler tabs to be correct. Existing dropdown overlapped with other UI elements hindering the usage of UI (CDAP-17043).
-
Fixes a bug to allow large pipelines with
-
in the name to properly render in UI (CDAP-17045). -
Improved state transitions for starting pipelines in the App Fabric system service to
increase stability if the service restarts unexpectedly (CDAP-17074). -
Fixed a bug that caused splitter transforms to be unable to fetch their output ports and schemas (CDAP-17097).
-
Fixed a race condition in stopping Spark program in Standalone that can cause stop to hang (CDAP-17135).
CDAP 6.1.3
Summary
This release introduces performance improvements as well as few minor bug fixes. Some of the highlights are:
- Performance improvements
- Improve the performance of joiner plugins to better handle data skewness with capped memory.
- Improve the performance of aggregator by using new reducer APIs.
- Improve the program startup performance by using async operations.
- Improve the preview performance by limiting records in partitions.
- Pipeline Upgradability
- Support upgrading all pipelines in a namespace via REST API to use latest available artifacts.
- Upgrade to use Google Cloud Dataproc API v1beta2 to allow endpoint configuration.
- Improved Error Messages, Preview Enhancement and Custom Google Cloud Dataproc Image Support
- Improve error messages for program execution.
- Minor bug fixes and enhancements for preview and pipeline execution.
- Allow user to select a Custom Google Cloud Dataproc Image by specifying image URI.
Performance Improvements
- Implemented performance improvements to joiner and aggregator plugins to cap the required memory to around 4gb per executor instead of scaling up as the skewness of the join goes up. Joins can now also be performed in-memory if one side is small, and behavior on null keys can be chosen by the user. (CDAP-16709)
- Added support for rendering large schemas (>1000 fields) in pipelines UI. By default collapse complex schemas and lazy-load fields in record types. (CDAP-16656)
- Improved program startup performance by using a thread pool to start a program start program instead of starting from a single thread. (CDAP-16521)
- Added payload compression support to messaging service. (CDAP-16673)
- Fixed a bug in wrangler that would cause it to go out of memory when sampling a Google Cloud Storage (GCS) object that has a lot of rows. (CDAP-16724)
- Fixed a bug where concurrent preview runs were failing because SparkConf for the new preview runs was getting populated with the configurations from the previously started in-progress preview run. (CDAP-16725)
- Changed default value of spark.network.timeout to 10 minutes to make pipeline execution more stable for shuffle heavy pipelines. (CDAP-17000)
Pipeline Upgradability
- Added support for upgrading applications via REST API. Example usage is to upgrade all pipelines in a namespace to use the latest available artifacts. (CDAP-16835)
- Introduced a new REST API for getting all application details across all namespaces. (CDAP-16918)
- Labeling Google Cloud Dataproc clusters configured as Remote Hadoop Provisioners. (CDAP-16328)
- Limit to reading in 100 records across all input partitions in preview. (CDAP-16606)
- Removed modal showing pipeline JSON when users export pipelines. Instead, the pipeline gets downloaded when users click "export pipeline" without the extra confirmation step. (CDAP-16621)
- Added a metric records.updated in Google BigQuery sink. This will give a total of all the inserts, updates and upsert into the sink. (CDAP-16815)
- Added the ability to select a Custom Google Cloud Dataproc Image. The complete URI for the custom image should be specified. (CDAP-16929)
User Interface Fixes
- Fixed a bug that caused long field names to overflow in the Joiner plugin. (CDAP-16493)
- Fix bug that resets preferences of an app/pipeline every 10 seconds. (CDAP-16645)
- Fixed a bug where UI incorrectly showed "No schema available" when the previous stages' output schema is a macro. (CDAP-16663)
- Fixed a bug where UI overwrites scale property of a decimal schema field if 0. (CDAP-16751)
- Remove reference to detailed view of an application. UI now only shows overview of custom applications and pipeline detailed view for pipelines when navigating from control center. (CDAP-16788)
- Fixed an issue where runtime arguments would lose focus after typing certain properties. (CDAP-16801)
- Fixed a bug that started running preview for pipelines with post-run actions even if the user choose option to not run preview. (CDAP-16845)
- Unsupported pipelines in drafts would be upgraded when users open them. (CDAP-16891)
- UI now waits for 5 mins for inactivity in the browser before stopping all the polling logic.This prevents stopping polling for resources that might take more than 30 seconds to respond (current timeout is 30 seconds). (CDAP-16940)
- Fixed an issue with runtime arguments re-rendering and losing focus when containing macros in preview. (CDAP-16959)
- Fixed an issue where preview config would open when trying to stop a preview. (CDAP-16972)
- UI now adds the latest version of plugin, among the list of different versions of the plugin, when added from the side panel in pipeline studio. (CDAP-16975)
- UI resets the default version of plugins for specific users during upgrade to enable users choose the latest version for pipeline studio. (CDAP-16976)
- Fixed a bug in preview for fields that have non-string types such as bytes. (CDAP-16993)
- Disable showing systems logs by default when viewing logs for a pipeline. (CDAP-16315)
- Fixed a bug to show master and worker memory in Google Cloud Dataproc compute profiles in GB. (CDAP-16240)
Plugin Fixes
- Fixed a bug that the failure error message emitted by Spark driver is not being collected. (CDAP-16055)
- Fixed the package references in the dynamic Spark plugin to use io.cdap instead of co.cask. (CDAP-16222)
- Fixed a bug with LimitingInputFormat that made DBSource plugin fail in preview mode. (CDAP-16453)
- Fixed a bug where Wrangler database connections could show more tables than those in the configured database. (CDAP-16465)
- Fixed PySpark support to work with Spark 2.1.3+. (CDAP-16870)
- Fixed a bug where memory, CPU, and engine config properties were not being set for sparkprogram plugins. (CDAP-16760)
- Fixed a bug that disallowed writing to an empty Google BigQuery table without any data or schema. (CDAP-15775)
- Added option to generate scoped GoogleCredentials with BQ and Drive scope for all BQ requests. (CDAP-16633)
- Fixed a bug in File Source to allow it to read files on Google Cloud Storage. (CDAP-16655)
- Fixed a bug that resulted in failure to update/upsert to BQ in a different project. (CDAP-16664)
- Added support for compressed file with header copying for text file based source. (CDAP-16809)
- 'Truncate table' and 'update schema' options if set together, will apply only WRITE_TRUNCATE to BQ job. (CDAP-16879)
- Removed schema validation from BQ sink when 'truncate table' option is set. (CDAP-16880)
Metadata Fixes
- Fixed a bug where field lineage is incorrect when a source is directly connected to a sink. (CDAP-16367)
API Fixes
- Unified JSON structure used by REST endpoints for getting pipeline configuration and deploying pipelines. (CDAP-16211)
- Fixed the fetch run records API to honor the limit query parameter correctly. (CDAP-16614)
Error Message Fixes
- Fixed the error message that the delimited format generates when the number of fields in the data does not match the number of fields in the schema. (CDAP-16507)
- Includes all ERROR level logs logged under the application logging context. (CDAP-16950)
Platform Fixes
- Added restrictions on the maximum number of network tags for Google Cloud Dataproc VM to 64. (CDAP-16593)
- Fixed record schema comparison to include record name. (CDAP-16736)
- Fixed schedule properties to overwrite preferences set on the application. (CDAP-16816)
CDAP 6.2.0
Summary
This release introduces a number of new features, improvements, and bug fixes to CDAP. Some of the main highlights of the release are:
- Replication
- A CDAP application using which you can easily replicate data at low-latency and in real-time from transactional and operational databases into analytical data warehouses.
- Google Cloud Dataproc Runtime Improvement
- The Google Cloud Dataproc runtime now uses native Dataproc API's for job submission instead of SSH.
- Pipeline Studio Improvements
- Added the ability to perform bulk operations (copy, delete) in the pipeline Studio. Also added a right-click context menu for the Studio.
New Features
- Added JDBC plugin selector widget. (CDAP-16385)
- Introduced a new REST endpoint for fetching scheduled time for multiple programs. (CDAP-16339)
- Added new capability to start system applications using application specific config during startup. (CDAP-16243)
- Added Replication feature. (CDAP-16223)
- Added support for connecting to multiple hubs through market.base.urls property in cdap-site. (CDAP-16210)
- Added the ability to right-click on the Pipeline Studio canvas to add a Wrangler source. This allows you to add multiple Wrangler sources (source + Wrangler transform) in the same pipeline without losing context. (CDAP-16130)
- Added support for Spark 2.4. (CDAP-16107)
- Added date picker widget to allow users to specify a single date or date range in a plugin. (CDAP-15941)
- Added support to launch a job using Google Cloud Dataproc APIs. (CDAP-15633)
- Added the ability to select multiple plugins and connections from Pipeline Studio copy or delete them in bulk. (CDAP-9014)
Improvements
- Added option to generate scoped GoogleCredentials with Google BigQuery and Google Drive scope for all Google BigQuery requests. (CDAP-16633)
- Added macro support for Format field in Google Cloud Storage plugin. (CDAP-16572)
- Added an option for Database source to replace characters in the field names. (CDAP-16525)
- Added support for copying header on compressed file. (CDAP-16809)
- Added support for rendering large schemas (>1000 fields) in Pipeline UI by collapsing complex schemas and lazy-load fields in record types. (CDAP-16656)
- Make the View Raw Logs and Download Logs buttons to be enabled all the time in the log viewer page. (CDAP-16616)
- Added restrictions on the maximum number of network tags for Dataproc VM to be 64. (CDAP-16593)
- Changed behavior for selecting multiple nodes in Studio to require the user to hold the key [shift] and click on the plugins (instead of holding [ctrl] and then click). (CDAP-16586)
- Improved program startup performance by using a thread pool to start a program instead of starting from a single thread. (CDAP-16521)
- Added an option to skip header in the files in delimited, csv, tsv, and text formats. (CDAP-16517)
- Reduced memory footprint for StructureRecord which improves overall memory consumption for pipeline execution. (CDAP-16509)
- Added an API that returns the names of input stages. (CDAP-16351)
- Replaced config.getProperties with config.getRawProperties to make sure validation happens on raw value before macros are evaluated. (CDAP-16330)
- Added macro support for Analytics plugins. (CDAP-16324)
- Reduced preview startup by 60%. Also added limit to maximum concurrent preview runs (10 by default). (CDAP-16308)
- Added ability to show dropped field operation from field level lineage page. (CDAP-16249)
- For field level lineage, added ability for user to view all fields in a cause or impact dataset (not just the related fields). (CDAP-16248)
- Unified JSON structure used by REST endpoints for fetching pipeline configuration and deploying pipelines. (CDAP-16211)
- Added ability for user to navigate to non-target dataset by selecting the header of the dataset in field level lineage. (CDAP-15894)
- Added the ability for SparkCompute and SparkSink to record field level lineage. (CDAP-15579)
- Added a page level error when the user navigates to an invalid pipeline via the URL. (CDAP-15061)
- Added support for recording field level lineage in streaming pipelines. (CDAP-13643)
Bug Fixes
- Fixed schedule properties to overwrite preferences set on the application instead of the other way around. This most visibly fixed a bug where the compute profile set on a pipeline schedule or trigger would get overwritten by the profile for the pipeline. (CDAP-16816)
- Fixed a bug where UI overwrites scale and precision properties in a schema with decimal logical type if the value is 0. (CDAP-16751)
- Fixed record schema comparison to include record name. (CDAP-16736)
- Fixed a bug where concurrent preview runs were failing because SparkConf for the new preview runs was getting populated with the configurations from the previously started in-progress preview run. (CDAP-16725)
- Fixed a bug in Wrangler that would cause it to go out of memory when sampling a Google Cloud Storage object that has a lot of rows. (CDAP-16724)
- Fixed a bug that resulted in failure to update/upsert to Google BigQuery in a different project. (CDAP-16664)
- Fixed a bug where UI incorrectly showed "No schema available" when the output of the previous stage is a macro. (CDAP-16663)
- Fixed a bug in File source that prevented reading files from Google Cloud Storage. (CDAP-16655)
- Fixed the fetch run records API to honor the limit query parameter correctly. (CDAP-16614)
- Fixed a bug that prevented a user from using parse-as-json directive in Wrangler. (CDAP-16581)
- Fixed a bug in the PluginProperties class where internal map was modifiable. (CDAP-16538)
- Fixed Google BigQuery sink to properly allow certain types as clustering fields. (CDAP-16526)
- Fixed a bug to correctly update pipeline stage metrics in UI. (CDAP-16501)
- Fixed a bug that would leave zombie processes when using the Remote Hadoop Provisioner. (CDAP-16471)
- Fixed a bug where Wrangler database connections could show more tables than those in the configured database. (CDAP-16465)
- Fixed a bug with LimitingInputFormat that made Database source plugin fail in preview mode. (CDAP-16453)
- Fixed macro support for output schema in Google BigQuery source plugin. (CDAP-16425)
- Fixed a race condition bug that can cause failure when running Spark program. (CDAP-16309)
- Fixed a bug to show master and worker memory in Google Cloud Dataproc compute profiles in GB. (CDAP-16240)
- Fixed a bug where the failure message emitted by Spark driver was not being collected. (CDAP-16055)
- Fixed a bug that caused errors when Wrangler's parse-as-csv with header was used when reading multiple small files. (CDAP-16002)
- Fixed a bug that disallowed writing to an empty Google BigQuery table without any data or schema. (CDAP-15775)
- Fixed a bug that would cause the Google BigQuery sink to fail the pipeline run if there was no data to write. (CDAP-15649)
- Fixed a bug in the custom date range picker that prevented users from setting a custom date range that is not in the current year. (CDAP-14850)
- Fixed a bug where users cannot delete the entire column name in Wrangler. (CDAP-14190)
CDAP 6.1.2
Summary
This release primarily focuses on bugfixes and performance improvements. Some of the highlights include,
-
Performance improvements
- Improve preview performance & limits concurrent preview runs to 10 by default
- Shift in polling logic to UI to avoid polling leaks in Nodejs server
- Batch API usage in UI to reduce the load on backend services
-
Pipeline and Plugin fixes
- Support Field Level Lineage for Streaming pipelines
- Improve Field Level Lineage computation algorithm
- Added support for Spark 2.4
- Improve memory consumption during pipeline execution
New Features
- Added the ability for SparkCompute and SparkSink to record field lineage. (CDAP-15579)
- Added support for Spark 2.4. (CDAP-16107)
- Added the ability to record field lineage for streaming pipelines. (CDAP-13643)
Bug Fixes
- Fixed a bug that caused errors when Wrangler's parse-as-csv with header was used when reading multiple small files.(CDAP-16002)
- Fixed the BigQuery sink to properly allow certain types as clustering fields.(CDAP-16526)
- Fixed a bug that would cause zombie processes when using the Remote Hadoop Provisioner.(CDAP-16471)
- Fixed a bug that getSchema is not working for database plugins.(CDAP-16472)
- Fixed a bug that made DBSource plugin fail in preview mode.(CDAP-16453)
- Fixed a race condition bug that can cause failure when running Spark program.(CDAP-16309)
Improvements
- Added an option to skip header in the files in delimited, csv, tsv and text formats.(CDAP-16517)
- Added an option for database source to replace the characters in the field names.(CDAP-16525)
- Reduce preview startup by by 60%. Also adds limit to max concurrent preview runs (10 by default).(CDAP-16308)
- Reduce memory footprint for StructureRecord which improves overall memory consumption for pipeline execution.(CDAP-16509)
- Introduced a new REST endpoint for fetching scheduled time for multiple programs.(CDAP-16339)
CDAP 6.1.1
Summary
This release introduces a number of new features, improvements, and bug fixes to CDAP. Some of the main highlights of the release are:
-
Pipeline improvements
- Validation checks for plugins for early error detection and prevention
- New widgets for better pipeline configurability
- Wrangler ADLS connection
-
Field Level Lineage
- New, intuitive UI for field level lineage
- Field level lineage support for more plugins
-
Platform enhancements
- Performance improvements across the platform
- Migration of more UI components from Angular to React
New Features
- Added field level lineage support for Error Transform.(CDAP-16102)
- Added region support for google cloud plugins.(CDAP-16037)
- New UI landing page.(CDAP-15795)
- Allow plugin developers to define filters to show/hide properties based on custom plugin configuration logic..(CDAP-15789)
- Introduced new FailureCollector apis for better user experience via contextual error messages.(CDAP-15787)
- Added support for reading INT96 types in parquet file sources..(CDAP-15767)
- New ConfigurationGroup component in UI.(CDAP-15728)
- Added support for pipeline to run in shared vpc network.(CDAP-15723)
- Stage level validation for plugin properties..(CDAP-15619)
- Added a new REST endpoint that retrieves back all field lineage information about a dataset..(CDAP-15482)
- Added support for bytes types in the bigquery sink.(CDAP-15342)
Deprecation
- Removed the outdated Validator plugin. (CDAP-15917)
Bug Fixes
- Fix the preview run state after JVM restarted(CDAP-16193)
- content type detection now uses case insensitive file extensions(CDAP-16146)
- Fixed bug that prevents users from navigating to pipeline studio (indicating system artifacts being loaded for a long time).(CDAP-16137)
- Fixed the dataproc provisioner to log the error message if the dataproc creation operation fails.(CDAP-15973)
- Fixed a bug that caused pipeline startup to take longer than needed for cloud runs(CDAP-15899)
- Fixed regex usage in GCS and S3 source plugins.(CDAP-15879)
- Fixed a bug with the Datastore source that was overly restrictive when validating the user provided schema(CDAP-15878)
- Fixing a bug which can cause a thread spinning in an infinite while loop due to multi thread consumers on a queue that allows a single consumer.(CDAP-15809)
- Fixed a bug that caused pipeline failures when writing nullable byte fields as json.(CDAP-15770)
- Fixed a bug that caused MapReduce and Spark logs to be missing for remote pipeline runs(CDAP-15757)
- Fixed a race condition that could cause a program to get stuck in the pending state when stopped in the pending state(CDAP-15747)
- Added some safeguards to prevent cloud pipeline runs from getting stuck in certain edge cases(CDAP-15742)
- Fixed a bug where secure macros were not evaluated in preview mode(CDAP-15726)
- Fixed a bug in the BigQuery source that cause automatic bucket creation to fail if the dataset is in a different project.(CDAP-15617)
- Fix bug in new user tour on lower resolution screens(CDAP-15583)
- Fixed a bug that wrong resolution is used if a time range is specified for metrics query(CDAP-15554)
- Fixed an issue where BigQuery multi sink doesn't work if using an Oracle database as a source.(CDAP-15535)
- Fixed the dataproc provisioner to disable YARN pre-emptive container killing and to disable conscrypt. (CDAP-15498)
- Fixed a bug in the MLPredictor plugin that caused error when using a classification model(CDAP-15445)
- Fixed bug that didn't allow users to paste schema as runtime argument(CDAP-15423)
- Spark pipelines no longer try to run sinks in parallel unless runtime argument 'pipeline.spark.parallel.sinks.enabled' is set to 'true'. This prevents pipeline sections from being re-processed in the majority of situations.(CDAP-15388)
- Fixed the dataproc provisioner to handle networks that do not use automatic subnet creation(CDAP-15373)
- Fixed a Wrangler bug where the wrong jdbc driver would be used in some situations and where required classes could be unavailable.(CDAP-15353)
- Fixed a bug about artifact version comparison(CDAP-15221)
- Fixed a bug that the rollup of the workflow lineage does not remove the local datasets.(CDAP-15206)
- Expanding filename format that UI takes in when uploading artifacts.(CDAP-15097)
Improvements
- Fixed batch pipeline preview to read only the preview records instead of the full input.(CDAP-16110)
- Greatly improved the time it takes to calculate field level lineage(CDAP-16069)
- Set Spark as the default execution engine for batch pipeline(CDAP-15983)
- Improved error message for csv, tsv, and delimited formats when the schema has fewer fields than the data(CDAP-15794)
- Added support to automatically fill field level lineage for plugins that do not emit any(CDAP-15782)
- Upgrades Nodejs version from 8.x to 10.16.2(CDAP-15738)
- Added support to restore preview status after restart(CDAP-15677)
- Route user directly to the pipeline's detail page from pipeline card in Control Center. (CDAP-15659)
- New user experience for log level selection.(CDAP-15489)
- Added image version as a configuration setting to the dataproc provisioner(CDAP-15265)
- Improved the way pipelines with macros that are provided by intermediate stages run.(CDAP-16076)
CDAP 6.0.0
Summary
This release introduces a number of new features, improvements, bug fixes and feature removal to CDAP. Some of the main highlights of the release are:
-
Portable CDAP Runtime
- Provide a runtime architecture for CDAP to support both Hadoop and Hadoopless environments, such as Kubernetes, in a distributed and secure fashion.
-
Storage SPIs
- Provide an abstraction for all CDAP system storage so that CDAP is more portable across runtime environments - Hadoop or Hadoop-free environments.
-
Pipeline Enhancements
- Improve experience of building pipelines with the help of features such as copy & paste and minimap of the pipeline.
Please note that upgrade capability of CDAP is not supported in this release. Please look at list of incompatible changes.
New Features
- Added Google Cloud Storage copy and move action plugins.(CDAP-14330)
- New pipeline list user interface.(CDAP-14533)
- Added minimap to pipeline canvas.(CDAP-14613)
- Added support for running CDAP system services in Kubernetes environment.(CDAP-14645)
- Added the ability to copy and paste a node in pipeline studio.(CDAP-14657)
- Added the ability to limit the number of concurrent pipeline runs.(CDAP-15058)
- Added support for toggling Stackdriver integration in Google Cloud Dataproc cluster.(CDAP-15095)
- Added support for Numeric and Array types in Google BigQuery plugins.(CDAP-15256)
- Added support for showing decimal field types in plugin schemas in pipeline view.(CDAP-15339)
Improvements
- Added support for CDH 5.15.(CDAP-13632)
- Revamps top navbar for CDAP UI based on material design.(CDAP-14653)
- Secure store supports integration with other KMS systems such as Google Cloud KMS using new Secure Store SPIs.(CDAP-14667)
- Improved CDAP Master logging of events related to programs that it launches.(CDAP-7208)
- Use a shared thread pool for provisioning tasks to increase thread utilization.(CDAP-14343)
- Improve performance of LevelDB backed Table implementation.(CDAP-14569)
- Wrangler supports secure macros in connection.(CDAP-14571)
- Significantly improve performance of Transactional Messaging System.(CDAP-14617)
- Added early validation for the properties of the Google BigQuery sink to fail during pipeline deployment instead of at runtime.(CDAP-14821)
- Improved the error message when a null value is read for a non-nullable field in avro file sources.(CDAP-14823)
- Improved loading of system artifacts to load in parallel instead of sequentially.(CDAP-15047)
- Improved Google Cloud Dataproc provisioner to allow configuring default projectID from CDAP configuration.(CDAP-15059)
- Added support of using runtime arguments to pass in extra configurations for Google Cloud Dataproc provisioner.(CDAP-15318)
- Added support for spaces in file path for Google Cloud Storage plugin.(CDAP-14579)
- Google BigQuery source now validates schema when the pipeline is deployed.(CDAP-14897)
Bug Fixes
- Fixed a casting bug for the DB source where unsigned integer column were incorrectly being treated as integers instead of longs.(CDAP-12211)
- Removed the need for ZooKeeper for service discovery in remote runtime environment.(CDAP-13410)
- Fixed an issue with recording lineage for realtime sources.(CDAP-7230)
- Fixed dynamic Spark plugin to use appropriate context classloader for loading dynamic Spark code.(CDAP-12941)
- Fixed a bug that caused MapReduce pipelines to fail when using too many macros.(CDAP-13554)
- Fixed an issue that caused pipelines with too many macros to fail when running in MapReduce.(CDAP-13982)
- Fixed an issue with publishing metadata changes for profile assignments.(CDAP-14666)
- Fixed a bug that would cause workspace ids to clash when wrangling items of the same name.(CDAP-14691)
- Fixed a bug in secure store caused by breaking changes in Java update 171. Users should be able to get secure keys on java 8u171.(CDAP-14702)
- Fixed a bug that caused Google Cloud Dataproc clusters to fail provisioning if a firewall rule that denies ingress traffic existed in the project.(CDAP-14708)
- Fixed a bug that would cause data preparation to fail when preparing a large file in Google Cloud Storage.(CDAP-14709)
- Fixed a bug that caused action-only pipelines to fail when running using a cloud profile.(CDAP-14724)
- Fixed an issue with adding business tags to an entity.(CDAP-14744)
- Fixed an issue in handling metadata search parameters.(CDAP-14778)
- Fixed a bug that would cause pipelines to fail on remote clusters if the very first pipeline run was an action-only pipeline.(CDAP-14779)
- Fixed the standard deviation aggregate functions to work, even if there is only one element in a group.(CDAP-14857)
- Fixed a bug in the Google BigQuery sink that would cause pipelines to fail when writing to a dataset in a different region.(CDAP-14951)
- Fixed a race condition in processing profile assignments.(CDAP-15001)
- Fixed an issue that could cause inconsistencies in metadata.(CDAP-15013)
- Fixed an issue with displaying workspace metadata in the UI.(CDAP-15069)
- Fixed a race condition in the remote runtime scp implementation that could cause process to hang.(CDAP-15127)
- Fixed an issue with metadata search result pagination.(CDAP-15196)
- Fixed Wrangler DB connection where a bad JDBC driver could stay in cache for 60 minutes, making DB connection not usable.(CDAP-15223)
- Fixed a NullPointerException in Google Cloud Dataproc provision for when there was no network configured.(CDAP-15249)
- Fixed a bug that caused some aggregator and joiner keys to be dropped if they hashed to the same value as another key.(CDAP-15299)
- Fixed a bug in the RuntimeMonitor that doesn't reconnect through SSH correctly, causing failure in monitoring the correct program state.(CDAP-15332)
- Fixed Google Cloud Dataproc runtime for Google Cloud Platform projects where OS Login is enabled.(CDAP-15369)
Deprecated and Removed Features
- Deprecated HDFSMove and HDFSDelete plugins from core plugins.(CDAP-15241)
- Removed Streams and Stream Views, which were deprecated in CDAP 5.0.(CDAP-14591)
- Removed Flow, which was deprecated in CDAP 5.0.(CDAP-14592)
- Removed deprecated HDFSSink Plugin.(CDAP-14529)
- Removed the plugin endpoints feature to prevent execution of plugin code in the cdap master. Endpoints were only used for schema propagation, which has moved to the pipeline system service.(CDAP-14772)
- Removed the support for custom routing for user services.(CDAP-14886)
CDAP 5.1.2
Improvements
- Improved performance of Apache Spark pipelines that write to multiple sinks. (CDAP-13430)
Bug Fixes
-
Fixed a bug where pipeline checkpointing is always on regardless of the value set by the user in realtime pipeline. (CDAP-14558)
-
Fixed a bug where artifacts could not be uploaded through UI. (CDAP-14578)
CDAP 4.3.5
New Features
- Added support for Apache Spark 2.3 (CDAP-13653)
Improvements
- Improved performance of spark pipelines that write to multiple sinks. (CDAP-13430)
Bug Fixes
-
Fixed macro enabled properties in plugin configuration to only have macro behavior if the entire value is a macro. (CDAP-13331)
-
Fixed a bug where the upgrade tool did not upgrade the owner meta table (CDAP-13372)
-
Fixed a bug where pipelines with conditions on different branches could not be deployed. (CDAP-13463)
-
Fixed an issue that prevented user runtime arguments from being used in CDAP programs (CDAP-13532)
-
Fixed a bug that under some race condition, running a pipeline preview may cause the CDAP process to shut down. (CDAP-13593)
-
Fixed a bug that could prevent CDAP startup in case the metadata tables were disabled. (CDAP-14019)
-
Fixed a bug to turn off pipeline checkpointing based on the config for a realtime pipeline. (CDAP-14558)