Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding partial_fit results #37

Open
fabriceyhc opened this issue Nov 3, 2022 · 0 comments
Open

Understanding partial_fit results #37

fabriceyhc opened this issue Nov 3, 2022 · 0 comments

Comments

@fabriceyhc
Copy link

fabriceyhc commented Nov 3, 2022

Hi there, I'm trying to use apricot to help find a diverse set of texts. When I use the fit method, everything works intuitively. However, when I start using the partial_fit method, the outputs do not appear to be correct. I suspect that I'm misunderstanding something about how the library works. In case I'm not, I've prepared a small demo of the issue with explanations of what I got vs. what I expected.

# environment setup
pip install textdiversity apricot-select --quiet
from textdiversity import POSSequenceDiversity
from apricot import FacilityLocationSelection

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

def test_apricot(featurizer, texts, fit_type="full_fit", batch_size = 2):
    selector = FacilityLocationSelection(
        n_samples=len(texts), 
        metric='euclidean', 
        optimizer='lazy')
    if fit_type == "full_fit":
        f, c = d.extract_features(texts)
        Z = d.calculate_similarities(f)
        selector.fit(Z)
    elif fit_type == "unbatched_partial":
        f, c = d.extract_features(texts)
        Z = d.calculate_similarities(f)
        selector.partial_fit(Z)
    elif fit_type == "batched_partial":
        for batch in chunker(texts, batch_size):
            f, c = d.extract_features(batch)
            Z = d.calculate_similarities(f)
            selector.partial_fit(Z)
    print(f"{fit_type} ranking: {selector.ranking} | gain: {sum(selector.gains)}")

# test ====================================================

d = POSSequenceDiversity()

texts = ["This is a test.", 
         "This is also a test.", 
         "This is the real deal.", 
         "So is this one."]

test_apricot(d, texts, "full_fit") # > ranking: [0 3 1 2] | gain: 2.8888888888888893
test_apricot(d, texts, "unbatched_partial") # > ranking: [0 1 2 3] | gain: 0.7222222222222221
test_apricot(d, texts, "batched_partial") #> ranking: [2 3] | gain: 0.4444444444444444

texts = ["This is the real deal.",
         "So is this one.",
         "This is a test.", 
         "This is also a test."]

test_apricot(d, texts, "full_fit") # > ranking: [0 1 3 2] | gain: 2.8888888888888893
test_apricot(d, texts, "unbatched_partial") # > ranking: [0 1 2 3] | gain: 0.7222222222222221
test_apricot(d, texts, "batched_partial") #> ranking: [0 1] | gain: 0.5

Full fit: makes intuitive sense. Texts with overlapping semantics get relegated to lower rankings, etc.
Unbatched partial: I would have expected the unbatched partial fit to behave the same as full fit, but no matter what order I put the texts in (e.g. reverse it or any other permutation), I always get [0 1 2 3]. Since the partial_fit method always provides the same ranking despite changes in the underlying order, this may indicate a bug or I don't understand it well enough. Please let me know.
Batched partial: This one is responsive to changes in the order of the texts, but a) does not respect the n_samples parameter (I wanted to rank all the texts) and b) does not appear to agree with the ranking from the full fit (which I trust the most, but unfortunately cannot use due to the size of my dataset).

Thanks for taking the time to read + potentially helping me out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant