CDAP 6.8.0
New Features
The Dataplex Batch Source and Dataplex Sink plugins are generally available (GA).
CDAP-19592: For Oracle (by Datastream) replication sources, added a purge policy for a GCS (Google Cloud Storage) bucket created by the plugin that Datastream will write its output to.
CDAP-19584: Added support for monitoring CDAP pipelines using an external tool.
CDAP-18450: Added support for AND triggers. Now, you can create OR and AND triggers. Previously, all triggers were OR triggers.
PLUGIN-871: Added support for BigQuery batch source pushdown.
Enhancements
CDAP-19678: Added ability to specify k8s affinity for CDAP services in CDAP custom resource.
CDAP-19605: Added ability to see the logs coming from Twill application master now in pipeline logs.
CDAP-19591: In the Datastream replication source, added the property GCS Bucket Location, which Datastream will write its output to.
CDAP-19590: In the Datastream replication source, added the list of Datastream regions to the Region property. You no longer need to manually enter the Datastream region.
CDAP-19589: For replication jobs with an Oracle (by Datastream) source, ensured data consistency when multiple CDC events are generated at the same timestamp, by ordering events reliably.
CDAP-19568: Significantly improved time it takes to start a pipeline (after provisioning).
CDAP-19555, CDAP-19554: Made the following improvements and changes for streaming pipelines with a single Kafka Consumer Streaming source and no Windower plugins:
Kafka Consumer Streaming source has native support so the data is guaranteed to be processed at least once.
CDAP-19501: For Replication jobs, improved performance for Review Assessment.
CDAP-19475: Modified /app endpoints (GET and POST) in AppLifecycleHttpHandler to include the following information in the response:
"change": {
"author": "joe",
"creationTimeMillis": 1668540944833,
"latest": true
}
The new information is included in the response for the following endpoints:
- Create an Application
- Update an Application
- List Applications
- Details of an Application
- Deploy an Artifact and Application
CDAP-19365: Changed the Datastream replication source to identify each row by the Primary key of the table. Previously, the plugin identified each row by the ROWID.
CDAP-19328: Splitter Transformation based plugins now have access to prepareRun, onRunFinish methods.
CDAP-18430: The Lineage page has a new look-and-feel.
Bug Fixes
CDAP-20002: Removed the CDAP Tour from the Welcome page.
CDAP-19939: Fixed an issue in the BigQuery target replication plugin that caused replication jobs to fail when replicating datetime columns from sources that are more precise than microsecond, for example datetime2 data type in SQL Server.
CDAP-19970: Google Cloud Data Loss Prevention plugins (version 1.4.0) are available in the CDAP Hub version 6.8.0 with the following changes:
- For the Google Cloud Data Loss Prevention (DLP) PII Filter Transformation, fixed an issue where pipelines failed because the DLP client was not initialized.
- For all of the Google Cloud Data Loss Prevention (DLP) transformations, added relevant exception details when validation of DLP Inspection template fails, rather than throwing a generic IllegalArgumentException.
CDAP-19630: For custom Dataproc compute profiles, fixed an issue where the wrong GCS bucket was used to stage data. Now, CDAP uses the GCS bucket specified in the custom compute profile.
CDAP-19599: Fixed an issue in the BigQuery Replication Target plugin that caused replication jobs to fail when the BigQuery target table already existed. The new version of the plugin will automatically be used in new replication jobs. Due to CDAP-19622, if you want to use the new plugin version in existing jobs, recreate each replication job.
CDAP-19486: In the Wrangler transformation, fixed an issue where the pipeline didn’t fail when the Error Handling property was set to Fail Pipeline. This happened when an error was returned, but no exception was thrown and there were 0 records in output. For example, this happened when one of the directive (such as parse-as-simple-date) failed because the input data was not in the correct format. This fix is under a feature flag and not available by default. If this feature flag is enabled, existing pipelines might fail if there are data issues since the default error handling property is set to Fail Pipeline.
CDAP-19481: Fixed an issue that caused Replication Assessment to hang when the Oracle (by Datastream) GCS Bucket property was empty or had an invalid bucket name. Now, CDAP returns a 400 error code during assessment when the property is empty or has an invalid bucket name.
CDAP-19455: Added user error tags to Dataproc errors returned during cluster creation and job submission. Added ability to set troubleshooting docs url in CDAP site for Dataproc API errors.
CDAP-19442: Fixed an issue that caused Replication jobs to fail when the source column name didn’t comply with BigQuery naming conventions. Now, if a source column name doesn’t comply with BigQuery naming conventions, CDAP replaces invalid characters with an underscore, prepends an underscore if the first character is a number, and truncates the name if it exceeds the maximum length.
CDAP-19266: In the File batch source, fixed an issue where Get Schema appeared only when Format was set to delimited. Now, Get Schema appears for all formats.
CDAP-18846: Fixed issue with the output schema when connecting a Splitter transformation with a Joiner transformation.
CDAP-18302: Fixed an issue where Compute Profile creation failed without showing an error message in the CDAP UI. Now, CDAP shows an error message when a Compute Profile is missing required properties.
CDAP-17619: Fixed an issue that caused imports in the CDAP UI to fail for pipelines exported through the Pipeline Microservices.
CDAP-13130: Fixed an issue where you couldn’t keep an earlier version of a plugin when you exported a pipeline and then imported it into the same version of CDAP, even though the earlier version of the plugin is deployed in CDAP. Now, if you export a pipeline with an earlier version of a plugin, when you import the pipeline, you can choose to keep the earlier version or upgrade it to the current version. For example, if you export a pipeline with a BigQuery source (version 0.21.0) and then import it into the same CDAP instance, you can choose to keep version 0.20.0 or upgrade to version 0.21.0.
PLUGIN-1433: In the Oracle Batch Source, when the source data included fields with the Numeric data type (undefined precision and scale), CDAP set the precision to 38 and the scale to 0. If any values in the field had scale other than 0, CDAP truncated these values, which could have resulted in data loss. If the scale for a field was overridden in the plugin output schema, the pipeline failed.
Now, if an Oracle source has Numeric data type fields with undefined precision and scale, you must manually set the scale for these fields in the plugin output schema. When you run the pipeline, the pipeline will not fail and the new scale will be used for the field instead. However, there might be truncation if there are any Numbers present in the fields with the scale greater than the scale defined in the plugin. CDAP writes warning messages in the pipeline log indicating the presence of Numbers with undefined precision and scale in the pipeline. For more information about setting precision and scale in a plugin, see Changing the precision and scale for decimal fields in the output schema.
PLUGIN-1325: In Wrangler, fixed an issue that caused the Wrangler UI to hang when a BigQuery table name contained characters besides alphanumeric characters and underscores (such as a dash). Now, Wrangler successfully imports BigQuery tables that comply with BigQuery naming conventions.
PLUGIN-826: In the HTTP batch source plugin, fixed an issue where validation failed when the URL property contained a macro and Pagination Type was set to Increment an index.
PLUGIN-1378: In the Dataplex Sink plugin, added a new property, Update Dataplex Metadata, which adds support for updating metadata in Dataplex for newly generated data.
PLUGIN-1374: Improved performance for batch pipelines with MySQL sinks.
PLUGIN-1333: Improved Kafka Producer Sink performance.
PLUGIN-664: In the Google Cloud Storage Delete Action plugin, added support for bulk deletion of files and folders. You can now use the (*) wildcard character to represent any character.
PLUGIN-641: In Wrangler, added the Average arithmetic function, which calculates the average of the selected columns.
In Wrangler, Numeric functions support 3 or more columns.
Security Fixes
The following vulnerabilities were found in open source libraries:
- Arbitrary Code Execution
- Deserialization of Untrusted Data
- SQL Injection
- Information Exposure
- Hash Collision
- Remote Code Execution (RCE)
To address these vulnerabilities, the following libraries have security fixes:
- commons-collections:commons-collections (Deserialization of Untrusted Data). Upgraded to apply security fixes.
- commons-fileupload:commons-fileupload (Arbitrary Code Execution). Upgraded to apply security fixes.
- ch.qos.logback:logback-core (Arbitrary Code Execution). Upgraded to apply security fixes.
- org.apache.hive:hive-jdbc (SQL Injection). Excluded org.apache.hive:hive-jdbc dependency
- org.bouncycastle:bcprov-jdk16 (Hash Collision)
- com.fasterxml.jackson.core:jackson-databind (Deserialization of Untrusted Data). Upgraded to apply security fixes.
Deprecations
CDAP-19559: For streaming pipelines, the Pipeline configuration properties Checkpointing and Checkpoint directory are deprecated. Setting these properties will no longer have any effect.
CDAP will decide automatically if checkpointing or CDAP internal state tracking is enabled. To disable at least once processing in streaming pipelines, you can set the runtime argument cdap.streaming.atleastonce.enabled. Both Spark checkpointing and state tracking will be disabled if this is set to false.