Rust implementation of the CVM counting algorithm

This library implements the algorithm described in

Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). Distinct Elements in Streams: An Algorithm for the (Text) Book. 6 pages, 727571 bytes. https://doi.org/10.4230/LIPIcs.ESA.2022.34

The accompanying article in Quanta is here: https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

Installation

Binaries and installation instructions are available for x64 Linux, Apple Silicon and Intel, and x64 Windows in releases

You can also build your own if you have Rust installed: cargo install cvmcount.

CLI Example

cvmcount -t file.txt -e 0.8 -d 0.1 -s 2900

-t --tokens: a valid path to a text file

-e --epsilon: how close you want your estimate to be to the true number of distinct tokens. A smaller ε means you require a more precise estimate. For example, ε = 0.05 means you want your estimate to be within 5 % of the actual value. An epsilon of 0.8 is a good starting point for most applications.

-d --delta: the level of certainty that the algorithm's estimate will fall within your desired accuracy range. A higher confidence (e.g. 99.9 %) means you're very sure the estimate will be accurate, while a lower confidence (e.g. 90 %) means there's a higher chance the estimate may be outside your desired range. A delta of 0.1 is a good starting point for most applications

-s --streamsize: this is used to determine buffer size and can be a loose approximation. The closer it is to the stream size, the more accurate the results

The --help option is available.

Note

If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed!

Perf

Calculating the unique tokens in a 418K UTF-8 text file takes 18.6 ms ± 0.3 ms on an M2 Pro

Implementation Details

This library strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rust implementation of the CVM counting algorithm

Installation

CLI Example

Note

Perf

Implementation Details

About

Releases

Packages

Languages

kuldeepmeel/cvmcount

Folders and files

Latest commit

History

Repository files navigation

Rust implementation of the CVM counting algorithm

Installation

CLI Example

Note

Perf

Implementation Details

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages