Releases: cdapio/cdap
Cask Data Application Platform v3.0.0
New Features
- Support for Application Templates has been added (CDAP-1753).
- Built-in ETL Application Templates and Plugins have been added (CDAP-1767).
- New CDAP UI, supports creating ETL applications directly in the web UI.
- Workflow logs can now be retrieved using the CDP HTTP Logging RESTful API (CDAP-1089).
- Support has been added for suspending and resuming of a Workflow (CDAP-1610).
- Condition nodes in a Workflow now allow branching based on a boolean predicate (CDAP-1928).
- Condition nodes in a Workflow now allow passing the Hadoop counters from a MapReduce program to following Condition nodes in the Workflow (CDAP-1611).
- Logs can now be fetched based on the run-id (CDAP-1582).
- CDAP Tables are now explorable (CDAP-946).
- The CDAP CLI supports the new Application Template and Adapters APIs. (CDAP-1773).
- The CDAP CLI startup options have been changed to accommodate a new option of executing a file containing a series of CLI commands, line-by-line.
- Both grok and syslog record formats can now be used when setting the format of a Stream (CDAP-1949).
- Added HTTP RESTful endpoints for listing Datasets and Streams as used by Adapters, Programs, and Applications, and vice-versa (CDAP-2214).
- Created a queue introspection tool, for counting processed and unprocessed entries in a Flowlet queue (CDAP-2105).
- Support for CDAP SDK VM build automation has been added (CDAP-2030).
- A Cube Dataset has been added (CDAP-1520).
- A Batch and realtime Cube dataset sink has been added (CDAP-1520).
- Metrics and status information for MapReduce on a task level is now exposed (CDAP-1520).
- The Metrics system APIs have been revised and improved (CDAP-1596).
- The Metrics system performance has been improved (CDAP-2124), (CDAP-2125).
Bug Fixes
- The CDAP Authentication server now reports the port correctly when the port is set to 0 (CDAP-614).
- History of the programs running under Workflow (Spark and MapReduce) is now updated correctly (CDAP-1293).
- Programs running under a Workflow now receive a unique run-id (CDAP-2025).
- RunRecords are now updated with the RuntimeService to account for node failures (CDAP-2202).
- MapReduce metrics are now available on a secure cluster (CDAP-64).
Deprecated and removed feature
- The File DropZone and File Tailer are no longer supported as of Release 3.0.
- Support for Procedures has been removed. After upgrading, an Application that contained a Procedure must be redeployed.
- Support for Service Workers have been removed. After upgrading, an Application that contained a Service
- Worker must be redeployed.
- The Old CDAP Console has been deprecated.
- Support for JDK/JRE 1.6 (Java 6) has ended; JDK/JRE 1.7 (Java 7) is now required for CDAP Distributed or the CDAP SDK .
Cask Data Application Platform v2.8.0
General
- The HTTP RESTful API v2 is deprecated, replaced with the namespaced HTTP RESTful API v3.
- Added log rotation for CDAP programs running in YARN containers (CDAP-1295).
- Added the ability to submit to non-default YARN queues to provide resource guarantees for CDAP Master Services, CDAP Programs, and Explore Queries (CDAP-1417).
- Added the ability to prune invalid transactions (CDAP-1540).
- Added the ability to specific custom logback file for CDAP programs (CDAP-1100).
- System HTTP services now bind to all interfaces (0.0.0.0), rather than 127.0.0.1.
New Features
- Command Line Interface (CLI)
- CLI can now directly connect to a CDAP instance of your choice at startup by using
cdap-cli.sh --uri <uri>
. - Support for runtime arguments, which can be listed by running
"cdap-cli.sh --help"
. - Table rendering can be configured using
"cli render as <alt|csv>"
.
The option"alt"
is the default, with"csv"
available for copy & pasting. - Stream statistics can be computed using
"get stream-stats <stream-id>"
.
- CLI can now directly connect to a CDAP instance of your choice at startup by using
- Datasets
- Added an ObjectMappedTable Dataset that maps object fields to table columns and that is also explorable.
- Added a PartitionedFileSet Dataset that allows addressing files by meta data and that is also explorable.
- Table Datasets now support a multi-get operation for batched reads.
- Allow an unchecked Dataset upgrade upon application deployment
(CDAP-1574).
- Metrics
- Added new APIs for exploring available metrics, including drilling down into the context of emitted metrics
- Added the ability to explore (search) all metrics; previously, this was restricted to custom user metrics
- There are new APIs for querying metrics
- New capability to break down a metrics time series using the values of one or more tags in its context
- Namespaces
- Applications and Programs are now managed within namespaces.
- Application logs are available within namespaces.
- Metrics are now collected and queried within namespaces.
- Datasets can now created and managed within namespaces.
- Streams are now namespaced for ingestion, fetching, and consuming by programs.
- Explore operations are now namespaced.
- Preferences
- Users can store preferences (a property map) at the instance, namespace, application, or program level.
- Spark
- Spark now uses a configurer-style API for specifying (CDAP-382).
- Workflows
- Users can schedule a Workflow based on increments of data being ingested into a Stream.
- Workflows can be stopped.
- The execution of a Workflow can be forked into parallelized branches.
- The runtime arguments for Workflow can be scoped.
- Workers
- Added Worker, a new Program type that can be added to CDAP Applications, used to run background processes and (beta feature) can write to Streams through the WorkerContext.
- Upgrade and Data Migration Tool
- Added an automated upgrade tool which supports upgrading from 2.6.x to 2.8.0. (Note: Apps need to be both recompiled and re-deployed). Upgrade from 2.7.x to 2.8.0 is not currently supported. If you have a use case for it, please reach out to us at [email protected].
- Added a metric migration tool which migrates old metrics to the new 2.8 format.
Improvement
- Improved Flow performance and scalability with a new distributed queue implementation.
API Changes
- The endpoint (
GET <base-url>/data/explore/datasets/<dataset-name>/schema
) that retrieved the schema of a Dataset's underlying Hive table has been removed (CDAP-1603). - Endpoints have been added to retrieve the CDAP version and the current configurations of CDAP and HBase.
Known Issues
-
If the Hive Metastore is restarted while the CDAP Explore Service is running, the Explore Service remains alive, but becomes unusable. To correct, restart the CDAP Master, which will restart all services (CDAP-1007).
-
User datasets with names starting with
"system"
can potentially cause conflicts (CDAP-1587). -
Scaling the number of metrics processor instances doesn't automatically distribute the processing load to the newer instances of the metrics processor. The CDAP Master needs to be restarted to effectively distribute the processing across all metrics processor instances (CDAP-1853).
-
Creating a dataset in a non-existent namespace manifests in the RESTful API with an incorrect error message (CDAP-1864).
-
Retrieving multiple metrics |---| by issuing an HTTP POST request with a JSON list as the request body that enumerates the name and attributes for each metric |---| is currently not supported in the Metrics HTTP RESTful API v3. Instead, use the v2 API. It will be supported in a future release.
-
Typically, Datasets are bundled as part of Applications. When an Application is upgraded and redeployed, any changes in Datasets will not be redeployed. This is because Datasets can be shared across applications, and an incompatible schema change can break other applications that are using the Dataset. A workaround (CDAP-1253) is to allow unchecked Dataset upgrades. Upgrades cause the Dataset metadata, i.e. its specification including properties, to be updated. The Dataset runtime code is also updated. To prevent data loss the existing data and the underlying HBase tables remain as-is.
You can allow unchecked Dataset upgrades by setting the configuration property
dataset.unchecked.upgrade
totrue
incdap-site.xml
. This will ensure that Datasets are upgraded when the Application is redeployed. When this configuration is set, the recommended process to deploy an upgraded Dataset is to first stop all Applications that are using the Dataset before deploying the new version of the Application. This lets all containers (Flows, Services, etc) to pick up the new Dataset changes. When Datasets are upgraded usingdataset.unchecked.upgrade
, no schema compatibility checks are performed by the system. Hence it is very important that the developer verify the backward-compatibility, and makes sure that other Applications that are using the Dataset can work with the new changes.
Cask Data Application Platform v2.6.2
New Features
- Added log rotation for CDAP programs running in YARN containers
(CDAP-1295) - Added the ability to submit to non-default YARN queues to provide resource guarantees for CDAP Master Services, CDAP Programs, and Explore Queries
(CDAP-1417) - Added the ability to prune invalid transactions
(CDAP-1540) - Added the ability to specify custom logback file for CDAP programs
(CDAP-1741)
Known Issues
- See also the Known Issues of version 2.6.1.
- CDAP works only with node.js versions 0.8.16 through 0.10.36.
- When the CDAP CLI starts up, it auto-connects to localhost. After a
connect <hostname>
command is issued from within the CLI, all operations will work except for Explore queries (the commandexecute 'query'
), as the Explore Client doesn't pick up the change of hostname. A workaround is to start up the CLI with the environment variable CDAP_HOST set to the desired hostname, so that the CLI autoconnects to that host on startup. This has been fixed in an upcoming release (2.8.0) of CDAP.
Cask Data Application Platform v2.7.1
API Changes
- The property
security.auth.server.address
has been deprecated and replaced with
security.auth.server.bind.address
CDAP-639,
CDAP-1078.
New Features
- Spark
- Security
- CDAP Master now obtains and refreshes Kerberos tickets programmatically CDAP-1134.
- Datasets
- A new, experimental dataset type to support time-partitioned File sets has been added.
- Time-partitioned File sets can be queried with Impala on CDH distributions CDAP-926.
- Streams can be made queryable with Impala by deploying an adapter that periodically
converts it into partitions of a time-partitioned File set CDAP-1129. - Support for different levels of conflict detection:
ROW
,COLUMN
, orNONE
CDAP-1016. - Removed support for
@DisableTransaction
CDAP-1279. - Support for annotating a Stream with a schema CDAP-606.
- A new API for uploading entire files to a Stream has been added CDAP-411.
- Workflow
- Workflow now uses a configurer-style API for specifying CDAP-1207.
- Multiple instances of a Workflow can be run concurrently CDAP-513.
- Programs are no longer part of a Workflow; instead, they are added in the application
and are referenced by a Workflow using their names CDAP-1116. - Schedules are now at the application level and properties can be specified for
Schedules; these properties will be passed to the scheduled program as runtime
arguments CDAP-1148.
Known Issues
- See also the Known Issues of version 2.6.1.
- When upgrading an existing CDAP installation to 2.7.0, all metrics are reset.
Cask Data Application Platform v2.6.1
Release Notes
CDAP Bug Fixes
- Allow an unchecked Dataset upgrade upon application deployment CDAP-1253.
- Update the Hive Dataset table when a Dataset is updated CDAP-71.
- Use Hadoop configuration files bundled with the Explore Service CDAP-1250.
Known Issues
See also the Known Issues of version 2.6.0.
Typically Datasets are bundled as part of Applications. When an Application is upgraded and redeployed, any changes in Datasets will not be redeployed. This is because Datasets can be shared across applications, and an incompatible schema change can break other applications that are using the Dataset. A workaround CDAP-1253 is to allow unchecked Dataset upgrades. Upgrades cause the Dataset metadata i.e. it’s specification, including properties, to be updated. The Dataset runtime code is also updated. To prevent data loss the existing data and the underlying HBase tables remain as is.
You can allow unchecked Dataset upgrades by setting the configuration property dataset.unchecked.upgrade
to true
in cdap-site.xml
. This will ensure that Datasets are upgraded when the Application is redeployed. When this configuration is set, the recommended process to deploy an upgraded Dataset is to first stop all Applications that are using the Dataset before deploying the new version of the Application. This lets all containers (Flows, Services, etc) to pick up the new Dataset changes. When Datasets are upgraded using dataset.unchecked.upgrade
, no schema compatibility checks are performed by the system. Hence it is very important that the developer verify the backward-compatibility, and makes sure that other Applications that are using the Dataset can work with the new changes.
Cask Data Application Platform v2.6.0
API Changes
- API for specifying Services and MapReduce Jobs has been changed to use a "configurer"
style; this will require modification of user classes implementing either MapReduce
or Service as the interfaces have changed (CDAP-335).
New Features
General
- Health checks are now available for CDAP system services
(CDAP-663).
Applications
- Jar deployment now uses a chunked request and writes to a local temp file
(CDAP-91).
MapReduce
- MapReduce jobs can now read binary stream data
(CDAP-331).
Datasets
Spark
- Spark programs now emit system and custom user metrics
(CDAP-346). - Services can be called from Spark programs and its worker nodes
(CDAP-348). - Spark programs can now read from Streams
(CDAP-403). - Added Spark support to the CDAP CLI (Command-line Interface)
(CDAP-425). - Improved speed of Spark unit tests
(CDAP-600). - Spark Programs now display system metrics in the CDAP Console
(CDAP-652).
Procedures
- Procedures have been deprecated in favor of Services
(CDAP-413).
Services
- Added an HTTP endpoint that returns the endpoints a particular Service exposes
(CDAP-412). - Added an HTTP endpoint that lists all Services
(CDAP-469). - Default metrics for Services have been added to the CDAP Console
(CDAP-512). - The annotations
@QueryParam
and@DefaultValue
are now supported in custom Service handlers
(CDAP-664).
Metrics
- System and User Metrics now support gauge metrics
(CDAP-484). - Metrics can be queried using a Program’s run-ID
(CDAP-620).
Documentation
- A Quick Start Guide has been added to the
CDAP Administration Manual
(CDAP-695).
CDAP Bug Fixes
- Fixed a problem with readless increments not being used when they were enabled in a Dataset
(CDAP-383). - Fixed a problem with applications, whose Spark or Scala user classes were not extended
from eitherJavaSparkProgram
orScalaSparkProgram
, failing with a class loading error
(CDAP-599). - Fixed a problem with the CDAP upgrade tool not preserving—for
tables with readless increments enabled—the coprocessor configuration during an upgrade
(CDAP-1044). - Fixed a problem with the readless increment implementation dropping increment cells when
a region flush or compaction occurred (CDAP-1062).
Known Issues
-
When running secure Hadoop clusters, metrics and debug logs from MapReduce programs are
not available CDAP-64 and CDAP-797. -
When upgrading a cluster from an earlier version of CDAP, warning messages may appear in
the master log indicating that in-transit (emitted, but not yet processed) metrics
system messages could not be decoded (Failed to decode message to MetricsRecord). This
is because of a change in the format of emitted metrics, and can result in a small
amount of metrics data points being lost (CDAP-745). -
Writing to datasets through Hive is not supported in CDH4.x
(CDAP-988). -
A race condition resulting in a deadlock can occur when a TwillRunnable container
shutdowns while it still has Zookeeper events to process. This occasionally surfaces when
running with OpenJDK or JDK7, though not with Oracle JDK6. It is caused by a change in the
ThreadPoolExecutor
implementation between Oracle JDK6 and OpenJDK/JDK7. Until Twill is
updated in a future version of CDAP, a work-around is to kill the errant process. The Yarn
command to list all running applications and theirapp-id
s isyarn application -list -appStates RUNNING
The command to kill a process is
yarn application -kill <app-id>
All versions of CDAP running Twill version 0.4.0 with this configuration can exhibit this
problem (TWILL-110).
Cask Data Application Platform v2.5.2
Release Notes
CDAP Bug Fixes
- Fixed a problem with a Coopr-provisioned secure cluster failing to start due to a classpath issue CDAP-478.
- Fixed a problem with the WISE app zip distribution not packaged correctly; a new version (0.2.1) has been released CDAP-533.
- Fixed a problem with the examples and tests incorrectly using the ByteBuffer.array method when reading a Stream event CDAP-549.
- Fixed a problem with the Authentication Server so that it can now communicate with an LDAP instance over SSL CDAP-556.
- Fixed a problem with the program class loader to allow applications to use a different version of a library than the one that the CDAP platform uses; for example, a different Kafka library CDAP-559.
- Fixed a problem with CDAP master not obtaining new delegation tokens after running for hbase.auth.key.update.interval milliseconds CDAP-562.
- Fixed a problem with the transaction not being rolled back when a user service handler throws an exception CDAP-607.
Other Changes
- Improved the CDAP documentation:
- Re-organized the documentation into three manuals—Developers’ Manual, Administration Manual, Reference Manual—and a set of examples, how-to guides and tutorials;
- Documents are now in smaller chapters, with numerous updates and revisions;
- Added a link for downloading an archive of the documentation for offline use;
- Added links to examples relevant to a particular component;
- Added suggested deployment architectures for Distributed CDAP installations;
- Added a glossary;
- Added navigation aids at the bottom of each page; and
- Tested and updated the Standalone CDAP examples and their documentation.
Known Issues
- Currently, applications that include Spark or Scala classes in user classes not extended from either JavaSparkProgram or ScalaSparkProgram (depending upon the language) fail with a class loading error. Spark or Scala classes should not be used outside of the Spark program. CDAP-599
- See Known Issues of the previous release, version 2.5.0.
Cask Data Application Platform v2.5.1
Release Notes
CDAP Bug Fixes
- Improved the documentation of the CDAP Authentication and Stream Clients, both Java and Python APIs.
- Fixed problems with the CDAP Command Line Interface (CLI):
- Did not work in non-interactive mode;
- Printed excessive debug log messages;
- Relative paths did not work as expected; and
- Failed to execute SQL queries.
- Removed dependencies on SNAPSHOT artifacts for netty-http and auth-clients.
- Corrected an error in the message printed by the startup script
cdap.sh
. - Resolved a problem with the reading of the properties file by the CDAP Flume Client of CDAP Ingest library
without first checking if authentication was enabled.
Other Changes
- The scripts
send-query.sh
,access-token.sh
andaccess-token.bat
has been replaced by the
CDAP Command Line Interface, <api.html#cli>
__cdap-cli.sh
. - The CDAP Command Line Interface now uses and saves access tokens when connecting to a secure CDAP instance.
- The CDAP Java Stream Client now allows empty String events to be sent.
- The CDAP Python Authentication Client's
configure()
method now takes a dictionary rather than a filepath.
Known Issues
See Known Issues of the previous release, version 2.5.0.
Cask Data Application Platform v2.5.0
Release Notes
New Features
Ad-hoc querying
- Capability to write to Datasets using SQL
- Added a CDAP JDBC driver allowing connections from Java applications and third-party business intelligence tools
- Ability to perform ad-hoc queries from the CDAP Console:
- Execute a SQL query from the Console
- View list of active, completed queries
- Download query results
Datasets
- Datasets can be tested with TestBase outside of the context of an Application
- CDAP now checks Datasets for compatibility in a verification stage
- The Transaction engine uses server-side filtering for efficient transactional reads
- Dataset specifications can now be dynamically reconfigured through the use of RESTful endpoints
- The Bundle jar format is now used for Dataset libs
- Increments on Datasets are now read-less
Services
- Added simplified APIs for using Services from other programs such as MapReduce, Flows and Procedures
- Added an API for creating Services and handlers that can use Datasets transactionally
- Added a RESTful API to make requests to a Service via the Router
Security
- Added authorization logging
- Added Kerberos authentication to Zookeeper secret keys
- Added support for SSL
Spark Integration
- Supports running Spark programs as a part of CDAP applications in Standalone mode
- Supports running Spark programs written with Spark versions 1.0.1 or 1.1.0
- Supports Spark's MLib and GraphX modules
- Includes three examples demonstrating CDAP Spark programs
- Adds display of Spark program logs and history in the CDAP Console
Streams
- Added a collection of applications, tools and APIs specifically for the ETL (Extract, Transform and Loading) of data
- Added support for asynchronously writing to Streams
Clients
- Added a Command-line Interface
- Added a Java Client Interface
Major CDAP Bug Fixes
- Fixed a problem with a HADOOP_HOME exception stacktrace when unit-testing an Application
- Fixed an issue with Hive creating directories in /tmp in the Standalone and unit-test frameworks
- Fixed a problem with type inconsistency of Service API calls, where numbers were showing up as strings
- Fixed an issue with the premature expiration of long-term Authentication Tokens
- Fixed an issue with the Dataset size metric showing data operations size instead of resource usage
Known Issues
- Metrics for MapReduce jobs aren't populated on secure Hadoop clusters
- The metric for the number of cores shown in the Resources view of the CDAP Console will be zero
unless YARN has been configured to enable virtual cores