You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there, I'm trying to use apricot to help find a diverse set of texts. When I use the fit method, everything works intuitively. However, when I start using the partial_fit method, the outputs do not appear to be correct. I suspect that I'm misunderstanding something about how the library works. In case I'm not, I've prepared a small demo of the issue with explanations of what I got vs. what I expected.
from textdiversity import POSSequenceDiversity
from apricot import FacilityLocationSelection
def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
def test_apricot(featurizer, texts, fit_type="full_fit", batch_size = 2):
selector = FacilityLocationSelection(
n_samples=len(texts),
metric='euclidean',
optimizer='lazy')
if fit_type == "full_fit":
f, c = d.extract_features(texts)
Z = d.calculate_similarities(f)
selector.fit(Z)
elif fit_type == "unbatched_partial":
f, c = d.extract_features(texts)
Z = d.calculate_similarities(f)
selector.partial_fit(Z)
elif fit_type == "batched_partial":
for batch in chunker(texts, batch_size):
f, c = d.extract_features(batch)
Z = d.calculate_similarities(f)
selector.partial_fit(Z)
print(f"{fit_type} ranking: {selector.ranking} | gain: {sum(selector.gains)}")
# test ====================================================
d = POSSequenceDiversity()
texts = ["This is a test.",
"This is also a test.",
"This is the real deal.",
"So is this one."]
test_apricot(d, texts, "full_fit") # > ranking: [0 3 1 2] | gain: 2.8888888888888893
test_apricot(d, texts, "unbatched_partial") # > ranking: [0 1 2 3] | gain: 0.7222222222222221
test_apricot(d, texts, "batched_partial") #> ranking: [2 3] | gain: 0.4444444444444444
texts = ["This is the real deal.",
"So is this one.",
"This is a test.",
"This is also a test."]
test_apricot(d, texts, "full_fit") # > ranking: [0 1 3 2] | gain: 2.8888888888888893
test_apricot(d, texts, "unbatched_partial") # > ranking: [0 1 2 3] | gain: 0.7222222222222221
test_apricot(d, texts, "batched_partial") #> ranking: [0 1] | gain: 0.5
Full fit: makes intuitive sense. Texts with overlapping semantics get relegated to lower rankings, etc. Unbatched partial: I would have expected the unbatched partial fit to behave the same as full fit, but no matter what order I put the texts in (e.g. reverse it or any other permutation), I always get [0 1 2 3]. Since the partial_fit method always provides the same ranking despite changes in the underlying order, this may indicate a bug or I don't understand it well enough. Please let me know. Batched partial: This one is responsive to changes in the order of the texts, but a) does not respect the n_samples parameter (I wanted to rank all the texts) and b) does not appear to agree with the ranking from the full fit (which I trust the most, but unfortunately cannot use due to the size of my dataset).
Thanks for taking the time to read + potentially helping me out.
The text was updated successfully, but these errors were encountered:
Hi there, I'm trying to use
apricot
to help find a diverse set of texts. When I use thefit
method, everything works intuitively. However, when I start using thepartial_fit
method, the outputs do not appear to be correct. I suspect that I'm misunderstanding something about how the library works. In case I'm not, I've prepared a small demo of the issue with explanations of what I got vs. what I expected.Full fit: makes intuitive sense. Texts with overlapping semantics get relegated to lower rankings, etc.
Unbatched partial: I would have expected the unbatched partial fit to behave the same as full fit, but no matter what order I put the texts in (e.g. reverse it or any other permutation), I always get [0 1 2 3]. Since the
partial_fit
method always provides the same ranking despite changes in the underlying order, this may indicate a bug or I don't understand it well enough. Please let me know.Batched partial: This one is responsive to changes in the order of the texts, but a) does not respect the
n_samples
parameter (I wanted to rank all the texts) and b) does not appear to agree with the ranking from the full fit (which I trust the most, but unfortunately cannot use due to the size of my dataset).Thanks for taking the time to read + potentially helping me out.
The text was updated successfully, but these errors were encountered: