Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Modify list_collections client methods to return a list of collection… #3216

Merged
merged 1 commit into from
Dec 16, 2024

Conversation

itaismith
Copy link
Contributor

Description of changes

list_collections currently returns collection objects. Using these objects can lead to errors, since some collections may have been created with embedding functions that are not available at the time of calling to list_collections. This PR modifies list_collections to return only collection names, until we have EF persistence.

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Docstrings and reference updated.

Copy link

github-actions bot commented Dec 1, 2024

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@tazarov
Copy link
Contributor

tazarov commented Dec 2, 2024

@itaismith, this feels extremely (and possibly unnecessarily) breaking change. I'll do some research on this but I feel that there are tons of apps with the following snippet of code:

for collection in client.list_collections():
  # do something with collection.name or collection.id or collection.metadata (e.g. filter based on name or some metadata)
  # or any of the other methods count(), add(), delete() ...

It seems to me that people have gotten around the issue that this PR is trying to address e.g. use a collection from list_collection to add data an expecting the embedding function to work out of the box.
However I think that the burden of this temporary workaround will cause more damage and time wasted even if in practice it is several lines of code (list + get_collection()). Furthermore isn't introducing a breaking change just so that we can revert back to original behavior in due course thus making it double work for anyone depending on list_collections()?

@atroyn
Copy link
Contributor

atroyn commented Dec 2, 2024

@tazarov we should make this breaking change because the default behavior is incorrect. When we fix the behavior, we can then have users do the right thing.

We should prefer to break sooner than later, and support users to the extent we can in the meantime. @itaismith we might want to be clever here and make a wrapper for str where if you try to call any collection methods on the name we return we give you an informative error and point you at the migration guide.

@HammadB
Copy link
Collaborator

HammadB commented Dec 2, 2024 via email

@itaismith itaismith force-pushed the itai/modify-list-collections-output branch from 395a827 to ce7320e Compare December 9, 2024 18:26
@itaismith itaismith force-pushed the itai/modify-list-collections-output branch from ce7320e to 5e4e3bc Compare December 9, 2024 18:36
@itaismith itaismith force-pushed the itai/modify-list-collections-output branch from 5e4e3bc to 732735d Compare December 9, 2024 20:09
@itaismith itaismith force-pushed the itai/modify-list-collections-output branch from 732735d to 146c7a2 Compare December 9, 2024 21:10
@itaismith itaismith force-pushed the itai/modify-list-collections-output branch from 146c7a2 to 71178d2 Compare December 9, 2024 21:24
@itaismith itaismith force-pushed the itai/modify-list-collections-output branch from 71178d2 to 072c87b Compare December 9, 2024 21:44
@itaismith itaismith force-pushed the itai/modify-list-collections-output branch from 072c87b to b8fb8aa Compare December 9, 2024 22:21
@itaismith itaismith force-pushed the itai/modify-list-collections-output branch from b8fb8aa to 0cc7a70 Compare December 10, 2024 17:57
Copy link
Contributor

@atroyn atroyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to be sure we aren't shadowing str functionality inadvertently, otherwise looks good to me minus nits.

if name in collection_attributes_and_methods:
raise NotImplementedError(
f"In Chroma v0.6.0, list_collections only returns collection names. "
f"Use get_collection to access Collection.{name}. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should read Client.get_collection("{name}")

Collection.{name} is confusing and not something anywhere in our API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name here is probably a confusing argument name. It is the attribute a user would try to access on the collection name string. So:

collection = client.list_collections()[0]
collection.add(...)

The error would be:

Use get_collection to access Collection.add

Copy link
Contributor

@atroyn atroyn Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see now. I think this is confusing and should be made clearer Maybe:

"Use Client.get_collection({collection_name}) to call {method_name}"

f"See https://docs.trychroma.com/deployment/migration for more information."
)

raise AttributeError(f"'CollectionName' object has no attribute '{name}'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is a bit funky because str doesn't have a __getattr__ but does have __getattribute__ which is subtly different. I'm concerned that we would be shadowing str method calls accidentally;

  • can we verify that this doesn't do anything unexpected with e.g. .join
  • if it doesn't, can we comment on how this works at the level of method dispatch in each case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is funky indeed. I can switch to explicit function definitions.

In any case this is a detailed description of the dispatch order:

When a user will try to access an attribute on a CollectionName string, the __getattribute__ method of str is invoked first. If a valid str method or property is found, it will be used. Otherwise, the fallback __getattr__ defined here is invoked next. It will error if the requested attribute is a Collection method or property.

For example:

collection_name = client.list_collections()[0] # collection_name = "test"
collection_name.startsWith("t") # Evaluates to True.

__getattribute__ is invoked first, selecting startsWith from str.

collection_name.add(ids=[...], documents=[...]) # Raises the error defined in this class.

__getattribute__ is invoked first, not finding a match in str. __getattr__ from this class is invoked and raises an error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's funky because if this was just str, __getattr__ would never get called (and doesn't exist) - this is handled by __getattribute__ checking for it. I think that's fine but we should make some sort of comment.

chromadb/api/async_client.py Outdated Show resolved Hide resolved
@@ -89,7 +89,7 @@ def list_collections(self) -> None:
colls = self.client.list_collections()
assert len(colls) == len(self.model)
for c in colls:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe explicitly name this variable now to indicate it's a name not an actual collection

Copy link
Contributor

@atroyn atroyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending nit.

if item in collection_attributes_and_methods:
raise NotImplementedError(
f"In Chroma v0.6.0, list_collections only returns collection names. "
f"Use Client.get_collection(collection_name) to access {item}. "
Copy link
Contributor

@atroyn atroyn Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we know collection_name here and should put it in the error string.

@itaismith itaismith merged commit 328ba8f into main Dec 16, 2024
71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants