Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH]: CIP - Expose Collection Dimensionality #1217

Conversation

tazarov
Copy link
Contributor

@tazarov tazarov commented Oct 8, 2023

Description of changes

Summarize the changes made by this PR.

  • New functionality
    • A new CIP to add ability for clients to view collection's dimensionality

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js

Documentation Changes

Docs PR TBD

A new CIP to add ability for clients to view collection's dimensionality
@github-actions
Copy link

github-actions bot commented Oct 8, 2023

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readbility, Modularity, Intuitiveness)

@HammadB HammadB self-requested a review November 29, 2023 18:35
@tazarov tazarov marked this pull request as draft December 1, 2023 08:07
# Conflicts:
#	chromadb/api/fastapi.py
#	chromadb/server/fastapi/__init__.py
#	chromadb/test/test_api.py
#	clients/js/yarn.lock
@tazarov
Copy link
Contributor Author

tazarov commented Dec 2, 2023

@beggers, some infra tests are failing for the past couple of days - https://github.com/chroma-core/chroma/actions/runs/7070701287/job/19247658540?pr=1217, got any idea?

@tazarov tazarov marked this pull request as ready for review December 2, 2023 14:39
@beggers
Copy link
Contributor

beggers commented Dec 5, 2023

I've seen a couple other similar flaky failures. I haven't investigated at all but I suspect we're running into the limits of what our Action runners can do. I'll look more deeply if it gets worse or I have a few free minutes.

@tazarov
Copy link
Contributor Author

tazarov commented Dec 5, 2023

@beggers, did you rerun it? Now all checks pass. @HammadB, this is ready for review.

@HammadB
Copy link
Collaborator

HammadB commented Dec 6, 2023

Can we add a docs PR?

@@ -281,6 +281,22 @@ def _count(self, collection_id: UUID) -> int:
"""
pass

@abstractmethod
def _dimensions(self, collection_id: UUID) -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A concern is this API implicltly prevents us / make maintaining a future multivector world harder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we swap with info then this concern is gone. For multi-vector, we can possibly show ranges of min/max vector dims in a col.

@jeffchuber
Copy link
Contributor

A few thoughts...

  1. a valid workaround is len(collection.get(limit=1)['embeddings'][0]) right? if we simply want to expose this in our clients... perhaps it is better to do this than to add a new end to end method?
  2. i dont like exposing dimensions as the method..... i think that it should be something more general like .describe or .info or .settings or.... something else.

Overall I don't think we should merge this, but I would like to here the "pro" case b/c im not sure i fully understand it

Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved code-wise, but lets wait for @jeffchuber to chime in before we merge.

I am curious about his thoughts on the problem of this making a collection 1:1 to one embedding type. I am personally ok with this / my gut is that because multivector requires many other api changes anyways (add takes embeddings) we will need to do a api change for it and can do something holistic when it makes sense.

@HammadB
Copy link
Collaborator

HammadB commented Dec 6, 2023

Hm @jeffchuber thats a good point. We also may want to show all indexes on a collection etc. I like the idea of a describe method or info method that dumps all that information, it also makes the api more flexible for the case I was mentioning

@tazarov
Copy link
Contributor Author

tazarov commented Dec 6, 2023

we can just then rename the existing dimensions to describe or info and add the dimensions as the first attributed to a dict of attributes.

@HammadB
Copy link
Collaborator

HammadB commented Dec 6, 2023

Lets do that - @jeffchuber are you OK with that?

@HammadB
Copy link
Collaborator

HammadB commented Dec 6, 2023

I like .info() personally!

@jeffchuber
Copy link
Contributor

  1. for renaming, let's enumerate 5-10 options (low effort, gives us confidence we are naming something well)

a valid workaround is len(collection.get(limit=1)['embeddings'][0]) right? if we simply want to expose this in our clients... perhaps it is better to do this than to add a new end to end method?

can we address this question?

@@ -626,6 +626,11 @@ def _count(self, collection_id: UUID) -> int:
return metadata_segment.count()

@trace_method("SegmentAPI._query", OpenTelemetryGranularity.OPERATION)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this trace is wrong FYI

@HammadB
Copy link
Collaborator

HammadB commented Dec 6, 2023

a valid workaround is len(collection.get(limit=1)['embeddings'][0]) right? if we simply want to expose this in our clients... perhaps it is better to do this than to add a new end to end method?

I prefer doing something explicit, especially if we are renaming to info() (or similar) and plan to shove more information into this method in the future. Asking the client to fetch the embeddings when we store the dimensionality just feels hacky for no good reason . Also it makes future clients less implementation dependent. I like how you can code-gen a reasonable client right now. We could of course make the server do a get() but I think thats odd for no reason.

@tazarov
Copy link
Contributor Author

tazarov commented Dec 6, 2023

a valid workaround is len(collection.get(limit=1)['embeddings'][0]) right? if we simply want to expose this in our clients... perhaps it is better to do this than to add a new end to end method?

can we address this question?

@jeffchuber,I think if we fetch embeddings, we'll load the whole vector index, which seems like a side effect we might not want, given that the info is already available in SQLite. To put it into perspective, if you want to show stats on the UI, every time someone looks at their collections, we'll load all of them into memory.

@jeffchuber
Copy link
Contributor

ok im game for adding it - makes sense, thanks for the perspective

lets get the naming and interface right (likely a dict)

here's a chatGPT dump

describe: This name suggests the function provides a detailed description of the object, potentially including its attributes, types, and other relevant information. It implies a comprehensive overview rather than just a quick summary.

get_info: This is a straightforward and clear name. It indicates that the function retrieves and returns information about the object. This name is quite versatile and can be used in a variety of contexts.

settings: This name might be a bit misleading unless the function specifically returns configuration settings of the object. If the function is intended to return general information, this name might not be as appropriate.

details: Similar to describe, this name suggests that the function returns detailed information about the object. It implies a thorough examination.

overview: This name implies a more summarized view of the object, possibly less detailed than what describe or details might suggest.

inspect: This name suggests a detailed examination of the object, possibly including internal states and hidden attributes.

summarize: This name implies that the function provides a concise summary of the object, focusing on key aspects rather than extensive details.

get_attributes: If the function specifically returns the attributes of the object, this name would be very clear and descriptive.

@tazarov
Copy link
Contributor Author

tazarov commented Dec 6, 2023

I wouldn't say I like settings as it implies you can change settings.

@tazarov
Copy link
Contributor Author

tazarov commented Dec 6, 2023

In DB world I think describe is the go-to word.

--- EDIT
and describe feels very pythonic as in describe collection

@tazarov
Copy link
Contributor Author

tazarov commented Dec 6, 2023

Synthetic data ftw:


Certainly! Let's score the method names on a scale of 1 to 10 based on their suitability in a database context for Python and Go, considering factors like clarity, convention, and intuitiveness:

describe:

Python: 9/10 (Widely recognized in similar contexts, very intuitive)
Go: 7/10 (Clear, but not as conventional in Go)
get_info:

Python: 8/10 (Pythonic and straightforward, but slightly generic)
Go: 8/10 (Clear and concise, fits Go's style)
details:

Python: 7/10 (Descriptive, but can be a bit vague)
Go: 8/10 (Direct and clear, good fit for Go)
inspect:

Python: 7/10 (Technical, suits Python's introspection but less common for DB)
Go: 6/10 (A bit less conventional, clear in purpose)
overview:

Python: 6/10 (Suggests summary, which might be misleading for detailed info)
Go: 5/10 (May not fully convey the depth of information provided)
summarize:

Python: 5/10 (Implies brevity, not ideal for detailed info)
Go: 4/10 (Might mislead about the function's depth)
settings:

Python: 3/10 (Misleading unless specifically about configuration settings)
Go: 3/10 (Not suitable unless for settings, could confuse)
get_attributes:

Python: 8/10 (Very clear if specifically about attributes, but narrow in scope)
Go: 7/10 (Explicit, but may be too specific if the method is more general)
These scores are based on how well the names align with each language's conventions and the expected functionality in a database context.

@jeffchuber
Copy link
Contributor

i like describe. i also like the midwit naming power of GPT 😆

@HammadB
Copy link
Collaborator

HammadB commented Dec 6, 2023

Lets do describe then and make the output a TypedDict

@tazarov
Copy link
Contributor Author

tazarov commented Dec 6, 2023

on it!

@tazarov
Copy link
Contributor Author

tazarov commented Dec 6, 2023

@HammadB TAL, I think it's ready.

@tazarov
Copy link
Contributor Author

tazarov commented Sep 17, 2024

Closing as dimensionality already exposed through collection model

@tazarov tazarov closed this Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants