These are the instructions for installing and using the DataStax Early Adopter Release using Docker images.
For the downloadable tarball version of the DSE EAP preview, refer to the DataStax Labs website and download DataStax EAP preview. Then, follow the instructions with the included README.md from the download.
The use of the software described here is subject to the DataStax Labs Terms.
The EAP release included in this Labs package includes the following new features and enhancements:
- DataStax Graph Core Engine - New Graph for Cassandra engine
- DB-424, Zero Copy Streaming
- DB-1960 - Incremental Nodesync
- DB-3289 - Opt-In Guardrails (# of Cols in a Table, # of tables, # of MVs and 2is, and partition size)
- DB-3170 - Allow setting of pre-hashed passwords via CQL
- DB-74 - New TRUNCATE and Update Permissions
- DSP-17044 - User’s can now supply TTL and WriteTime based on Column in a DataFrame
- DSP-18261 - Consolidate AOSS log files
- DB-413 - Implement encryption on the SSTable Partition Index
- DB-468 - Java 8 and 11 Support for DSE Core Only (C* Only)
- DB-2831 - Allow filtering using IN restrictions
- DSP-17586 - DSE Tools startup are faster
- DSP-13680, DSP-16873 - DSEFS more reliable startup, shutdown
- DSP-17823 - Spark 2.4 Upgrade
- DSP-15762 - Improve Spark Job Performance (by up to 60%) by Reducing Tombstones
- DSP-17431 - Remove Legacy Solr Join Syntax for Non Partition Key JOINS
- DB-2509 - Chunk cache heap overhead is too high
Please see the CHANGES-DSE.txt file located in the resources/cassandra folder on the dse node for a full list of all changes included in this release.
Please review the graph/graph-docs directory for an overview of the features, behaviors, and functionality of the experimental and new graph engine.
Zero Copy Streaming improves the performance of Streaming operations up to 4X. This is done by changing the streaming process to avoid any serialization during streaming such that the entire streaming process becomes a network copy. Zero Copy Streaming is enabled by default.
Zero Copy Streaming functions by straming only the required ranges in an sstable that require streaming as separate sstables, while the sstable metadata is streamed in its entirety and linked to every sstable produced on the destination node, which avoids the costly rebuilding of such metadata at the expense of additional disk usage (see zerocopy_max_unused_metadata_in_mb). All sstables and their components are also copied via zero-copy operations, greatly reducing GC pressure and improving overall speed.
This item introduces the following new cassandra.yaml properties, for more details see the dse-db/conf/cassandra.yaml file:
- zerocopy_streaming_enabled -- Enabled by default
- zerocopy_max_sstables -- Determines the max number of sstables a single sstable can be split into to actually use zero-copy rather than legacy streaming.
- zerocopy_max_unused_metadata_in_mb -- Determines how many megabytes per sstable of excess metadata are allowed in order to actually use zero-copy rather than legacy streaming
- stream_outbound_buffer_in_kb -- Buffer size for stream writes: each outbound streaming session will buffer writes according to such size.
- stream_max_outbound_buffers_in_kb -- Max amount of pending data to be written before pausing outbound streaming: this value is shared among all outbound streaming session, in order to cap the overall memory used by all streaming processes (bootstrap, repair etc).
NodeSync has a new incremental mode, which can be enabled on a per-table basis with: ALTER TABLE t WITH nodesync = { 'enabled': 'true', 'incremental': 'true'}
When enabled, new validations will not re-validate previously validated data, drastically lowering the work done by NodeSync (and thus its impact on the cluster). One (current) downside however is that if a node loses a sstable (for instance an sstable gets corrupted and either needs to be entirely deleted, or scrub is not able to recover all of its data), then a manual user validation needs to be triggered to ensure the lost data is recovered.
This item will be enabled by default when it the next version of DSE is GA. DataStax recommends using Incremental Nodesync over Nodesync.
With this release of DSE, DataStax is introducing a new concept to help users avoid making mistakes and implementing known anti-patterns in Cassandra. We call these items Guardrails. With this release of DSE, the Guardrails are not enabled by default unless otherwise noted. They are optional. Based on the experiences and feedback from DSE users, DataStax may make enable these items by default in future releases.
Here is the list of Guardrails introduced in this release. Please refere to the Cassandra.yaml file for more details on each guardrail:
- tombstone_warn_threshold: Default tombstone_warn_threshold is 1000
- tombstone_failure_threshold: Default tombstone_failure_threshold is 100000
- partition_size_warn_threshold_in_mb: Log a warning when compacting partitions larger than this value.
- batch_size_warn_threshold_in_kb: Log WARN on any multiple-partition batch size that exceeds this value. 64kb per batch by default.
- batch_size_fail_threshold_in_kb: Fail any multiple-partition batch that exceeds this value. The calculated default is 640kb (10x warn threshold).
- unlogged_batch_across_partitions_warn_threshold: Log WARN on any batches not of type LOGGED than span across more partitions than this limit.
- column_value_size_failure_threshold_in_kb: Failure threshold to prevent writing large column value into Cassandra.
- columns_per_table_failure_threshold: Failure threshold to prevent creating more columns per table than threshold.
- fields_per_udt_failure_threshold: Failure threshold to prevent creating more fields in user-defined-type than threshold.
- collection_size_warn_threshold_in_kb: Warning threshold to warn when encountering larger size of collection data than threshold.
- items_per_collection_warn_threshold: Warning threshold to warn when encountering more elements in collection than threshold.
- read_before_write_list_operations_enabled: Whether read-before-write operation is allowed, eg. setting list element by index, removing list element
- secondary_index_per_table_failure_threshold: Failure threshold to prevent creating more secondary index per table than threshold.
- materialized_view_per_table_failure_threshold:
- tables_warn_threshold: Warn threshold to warn creating more tables than threshold.
- tables_failure_threshold: Failure threshold to prevent creating more tables than threshold.
- table_properties_disallowed: Preventing creating tables with provided configurations.
- user_timestamps_enabled: Whether to allow user-provided timestamp in write request. Default is true.
- write_consistency_levels_disallowed: Preventing query with provided consistency levels
- page_size_failure_threshold_in_kb: Failure threshold to prevent providing larger paging by bytes than threshold, also served as a hard paging limit
- in_select_cartesian_product_failure_threshold: Failure threshold to prevent IN query creating size of cartesian product exceeding threshold, eg. "a in (1,2,...10) and b in (1,2...10)" results in cartesian product of 100.
- partition_keys_in_select_failure_threshold: Failure threshold to prevent IN query containing more partition keys than threshold
- disk_usage_percentage_warn_threshold: Warning threshold to warn when local disk usage exceeding threshold. Valid values: (1, 100]
- disk_usage_percentage_failure_threshold: Failure threshold to reject write requests if replica disk usage exceeding threshold. Valid values: (1, 100]
We want to hear your feedback! Go to the Labs section of the new DataStax Community forums: https://community.datastax.com/spaces/11/index.html
You can also reach out on the Labs forums for any help needed.