Skip to content
This repository has been archived by the owner on Jul 7, 2020. It is now read-only.

Recordinality implementation #101

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Recordinality implementation #101

wants to merge 4 commits into from

Conversation

positiveblue
Copy link
Contributor

  • I have implemented Recordinality, a stream algorithm. Recordinality class extends from ICardinality and Serialize. You should check the serialize process and the merge (return an exeption because Recordinality doesn't support it).
  • I have implemented some tests for the Recordinality class.
  • ReadMe modified, now with Recordinality paper information.

The main algorithm is working. Recordinality extends from ICardinality and Serialize but we need to performance some TODOs such as serialize, merge etc...

There are a new test class for Recordinality with three tests.
@tea-dragon
Copy link
Contributor

The algorithm and implementation look broadly correct, but there's some formatting and minor implementation details that would be nice to be addressed. eg. If merge isn't supported, it is fine to just throw an unsupported operation exception.

One note though:
Based on the paper, it seems like recordinality is pretty straightforwardly inferior to the existing hllp algorithm. Maybe if your values are naturally uniformly random longs, you could skip hashing them and thereby preserve a sample set, but that seems a bit outlandish.

I'm not opposed to including even if only for completionist or academic reasons (there seems to be a variety of suggested areas of further research), however, it seems appropriate document the class accordingly.

@tea-dragon
Copy link
Contributor

I don't mind handling such mundane things directly, so if you'd rather, I can just pull and then clean it up myself. Just lmk

@positiveblue
Copy link
Contributor Author

Hello Ian,

Thanks for taking a look at my code. I have written a basic version of Recordinality. The main reason has been that this pull request is to learn how the library work, which is the process to contribute, etc...

As you have said, it's possible to run the algorithm without hash the item before and it could be useful for strings for example. However, there are more Recordinality variations that could be interesting.For instance, Adaptive Recordinality (AR) has the feature that let them increase the memory usage if it's necessary. If you don't know which is te maximum cardinality you can run AR with only one position and let them increase if it's necesary.

I'm going to address the format and the merge 'errors' and after that will resubmit the pull request. After this, I will iterate more times adding some features (if it's right).

Thanks.

Merge function throw an UnsupportedOperationException.

Added an Ignore decorator to one test.
@positiveblue
Copy link
Contributor Author

Well,

I have reformated the code and added the UnsoprttedOperationException() to the merge function.
However I have some questions about the format tools: do you use some tool for check the format? I am not a Java programmer, now I am using IntelliJ but I don't know the ecosystem.

If there are some problems with this implementation let me know. As I told, I will implement the adaptive version in a near future.

Thanks.

@tea-dragon
Copy link
Contributor

We use intellij (long live it's glory!), so we mostly just have a formatting style we import/export to share. Deviating from it is fine, but it usually covers most minor issues in an unobjectionable manor. I'll make sure a recent version is available somewhere -- at that point you can just import settings from the jar and it should properly create an "AddThis" codestyle without interfering with other things.

@tea-dragon
Copy link
Contributor

As a quick kinda hint, generally the start of a file goes "copywriter/license headers", "package/imports", "class javadoc". I always have intellij fold (aka kinda hide) those first two components, so it works out fine in practice too.

@tea-dragon
Copy link
Contributor

(That said, today is kind of tightly scheduled -- think I'm already late for things -- so you might not see results from my promises until tomorrow)

@abramsm
Copy link
Contributor

abramsm commented Mar 24, 2017

Looks like all tests are passing. Can you take a look at an example file for formatting and headers? https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/ConcurrentStreamSummary.java The copyright should be at the top of the doc (before package)... thanks!

@positiveblue
Copy link
Contributor Author

I will recheck the headers this evening (PTZ).

@abramsm
Copy link
Contributor

abramsm commented Mar 28, 2017

Thanks, also suggest moving this to the same 'research' package mentioned for the HyperBitBit since it doesn't seem likely that this would be a preferable option to HLLP in production uses. Let me know if you feel differently.

@positiveblue
Copy link
Contributor Author

I agree with your suggestion.

Anyway, let me reimplement Recordinality and open some discussion on the mailing list because I think it has some "killer features" that could be useful.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants