-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH]: CIP - Expose Collection Dimensionality #1217
[ENH]: CIP - Expose Collection Dimensionality #1217
Conversation
A new CIP to add ability for clients to view collection's dimensionality
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
# Conflicts: # chromadb/api/fastapi.py # chromadb/server/fastapi/__init__.py # chromadb/test/test_api.py # clients/js/yarn.lock
@beggers, some infra tests are failing for the past couple of days - https://github.com/chroma-core/chroma/actions/runs/7070701287/job/19247658540?pr=1217, got any idea? |
I've seen a couple other similar flaky failures. I haven't investigated at all but I suspect we're running into the limits of what our Action runners can do. I'll look more deeply if it gets worse or I have a few free minutes. |
Can we add a docs PR? |
chromadb/api/__init__.py
Outdated
@@ -281,6 +281,22 @@ def _count(self, collection_id: UUID) -> int: | |||
""" | |||
pass | |||
|
|||
@abstractmethod | |||
def _dimensions(self, collection_id: UUID) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A concern is this API implicltly prevents us / make maintaining a future multivector world harder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we swap with info
then this concern is gone. For multi-vector, we can possibly show ranges of min/max vector dims in a col.
A few thoughts...
Overall I don't think we should merge this, but I would like to here the "pro" case b/c im not sure i fully understand it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved code-wise, but lets wait for @jeffchuber to chime in before we merge.
I am curious about his thoughts on the problem of this making a collection 1:1 to one embedding type. I am personally ok with this / my gut is that because multivector requires many other api changes anyways (add takes embeddings) we will need to do a api change for it and can do something holistic when it makes sense.
Hm @jeffchuber thats a good point. We also may want to show all indexes on a collection etc. I like the idea of a |
we can just then rename the existing |
Lets do that - @jeffchuber are you OK with that? |
I like .info() personally! |
can we address this question? |
chromadb/api/segment.py
Outdated
@@ -626,6 +626,11 @@ def _count(self, collection_id: UUID) -> int: | |||
return metadata_segment.count() | |||
|
|||
@trace_method("SegmentAPI._query", OpenTelemetryGranularity.OPERATION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this trace is wrong FYI
I prefer doing something explicit, especially if we are renaming to info() (or similar) and plan to shove more information into this method in the future. Asking the client to fetch the embeddings when we store the dimensionality just feels hacky for no good reason . Also it makes future clients less implementation dependent. I like how you can code-gen a reasonable client right now. We could of course make the server do a get() but I think thats odd for no reason. |
@jeffchuber,I think if we fetch embeddings, we'll load the whole vector index, which seems like a side effect we might not want, given that the info is already available in SQLite. To put it into perspective, if you want to show stats on the UI, every time someone looks at their collections, we'll load all of them into memory. |
ok im game for adding it - makes sense, thanks for the perspective lets get the naming and interface right (likely a dict) here's a chatGPT dump
|
I wouldn't say I like |
In DB world I think --- EDIT |
Synthetic data ftw:
|
i like describe. i also like the |
Lets do describe then and make the output a TypedDict |
on it! |
@HammadB TAL, I think it's ready. |
Closing as dimensionality already exposed through collection model |
Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
pytest
for python,yarn test
for jsDocumentation Changes
Docs PR TBD