[ENH]: CIP - Expose Collection Dimensionality #1217

tazarov · 2023-10-08T15:51:09Z

Description of changes

Summarize the changes made by this PR.

New functionality
- A new CIP to add ability for clients to view collection's dimensionality

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js

Documentation Changes

Docs PR TBD

A new CIP to add ability for clients to view collection's dimensionality

github-actions · 2023-10-08T15:51:25Z

# Conflicts: # chromadb/api/fastapi.py # chromadb/server/fastapi/__init__.py # chromadb/test/test_api.py # clients/js/yarn.lock

tazarov · 2023-12-02T14:30:26Z

@beggers, some infra tests are failing for the past couple of days - https://github.com/chroma-core/chroma/actions/runs/7070701287/job/19247658540?pr=1217, got any idea?

beggers · 2023-12-05T00:01:51Z

I've seen a couple other similar flaky failures. I haven't investigated at all but I suspect we're running into the limits of what our Action runners can do. I'll look more deeply if it gets worse or I have a few free minutes.

tazarov · 2023-12-05T13:45:40Z

@beggers, did you rerun it? Now all checks pass. @HammadB, this is ready for review.

HammadB · 2023-12-06T18:02:03Z

Can we add a docs PR?

HammadB · 2023-12-06T18:04:03Z

chromadb/api/__init__.py

@@ -281,6 +281,22 @@ def _count(self, collection_id: UUID) -> int:
        """
        pass

+    @abstractmethod
+    def _dimensions(self, collection_id: UUID) -> int:


A concern is this API implicltly prevents us / make maintaining a future multivector world harder

if we swap with info then this concern is gone. For multi-vector, we can possibly show ranges of min/max vector dims in a col.

jeffchuber · 2023-12-06T18:05:20Z

A few thoughts...

a valid workaround is len(collection.get(limit=1)['embeddings'][0]) right? if we simply want to expose this in our clients... perhaps it is better to do this than to add a new end to end method?
i dont like exposing dimensions as the method..... i think that it should be something more general like .describe or .info or .settings or.... something else.

Overall I don't think we should merge this, but I would like to here the "pro" case b/c im not sure i fully understand it

HammadB

Approved code-wise, but lets wait for @jeffchuber to chime in before we merge.

I am curious about his thoughts on the problem of this making a collection 1:1 to one embedding type. I am personally ok with this / my gut is that because multivector requires many other api changes anyways (add takes embeddings) we will need to do a api change for it and can do something holistic when it makes sense.

HammadB · 2023-12-06T18:07:12Z

Hm @jeffchuber thats a good point. We also may want to show all indexes on a collection etc. I like the idea of a describe method or info method that dumps all that information, it also makes the api more flexible for the case I was mentioning

tazarov · 2023-12-06T18:12:59Z

we can just then rename the existing dimensions to describe or info and add the dimensions as the first attributed to a dict of attributes.

HammadB · 2023-12-06T18:22:16Z

Lets do that - @jeffchuber are you OK with that?

HammadB · 2023-12-06T18:22:48Z

I like .info() personally!

jeffchuber · 2023-12-06T18:23:55Z

for renaming, let's enumerate 5-10 options (low effort, gives us confidence we are naming something well)

a valid workaround is len(collection.get(limit=1)['embeddings'][0]) right? if we simply want to expose this in our clients... perhaps it is better to do this than to add a new end to end method?

can we address this question?

HammadB · 2023-12-06T18:28:15Z

chromadb/api/segment.py

@@ -626,6 +626,11 @@ def _count(self, collection_id: UUID) -> int:
        return metadata_segment.count()

    @trace_method("SegmentAPI._query", OpenTelemetryGranularity.OPERATION)


this trace is wrong FYI

HammadB · 2023-12-06T18:29:21Z

a valid workaround is len(collection.get(limit=1)['embeddings'][0]) right? if we simply want to expose this in our clients... perhaps it is better to do this than to add a new end to end method?

I prefer doing something explicit, especially if we are renaming to info() (or similar) and plan to shove more information into this method in the future. Asking the client to fetch the embeddings when we store the dimensionality just feels hacky for no good reason . Also it makes future clients less implementation dependent. I like how you can code-gen a reasonable client right now. We could of course make the server do a get() but I think thats odd for no reason.

tazarov · 2023-12-06T18:31:11Z

a valid workaround is len(collection.get(limit=1)['embeddings'][0]) right? if we simply want to expose this in our clients... perhaps it is better to do this than to add a new end to end method?

can we address this question?

@jeffchuber,I think if we fetch embeddings, we'll load the whole vector index, which seems like a side effect we might not want, given that the info is already available in SQLite. To put it into perspective, if you want to show stats on the UI, every time someone looks at their collections, we'll load all of them into memory.

jeffchuber · 2023-12-06T18:33:27Z

ok im game for adding it - makes sense, thanks for the perspective

lets get the naming and interface right (likely a dict)

here's a chatGPT dump

describe: This name suggests the function provides a detailed description of the object, potentially including its attributes, types, and other relevant information. It implies a comprehensive overview rather than just a quick summary.

get_info: This is a straightforward and clear name. It indicates that the function retrieves and returns information about the object. This name is quite versatile and can be used in a variety of contexts.

settings: This name might be a bit misleading unless the function specifically returns configuration settings of the object. If the function is intended to return general information, this name might not be as appropriate.

details: Similar to describe, this name suggests that the function returns detailed information about the object. It implies a thorough examination.

overview: This name implies a more summarized view of the object, possibly less detailed than what describe or details might suggest.

inspect: This name suggests a detailed examination of the object, possibly including internal states and hidden attributes.

summarize: This name implies that the function provides a concise summary of the object, focusing on key aspects rather than extensive details.

get_attributes: If the function specifically returns the attributes of the object, this name would be very clear and descriptive.

tazarov · 2023-12-06T18:34:39Z

I wouldn't say I like settings as it implies you can change settings.

tazarov · 2023-12-06T18:35:18Z

In DB world I think describe is the go-to word.

--- EDIT
and describe feels very pythonic as in describe collection

tazarov · 2023-12-06T18:41:21Z

Synthetic data ftw:


Certainly! Let's score the method names on a scale of 1 to 10 based on their suitability in a database context for Python and Go, considering factors like clarity, convention, and intuitiveness:

describe:

Python: 9/10 (Widely recognized in similar contexts, very intuitive)
Go: 7/10 (Clear, but not as conventional in Go)
get_info:

Python: 8/10 (Pythonic and straightforward, but slightly generic)
Go: 8/10 (Clear and concise, fits Go's style)
details:

Python: 7/10 (Descriptive, but can be a bit vague)
Go: 8/10 (Direct and clear, good fit for Go)
inspect:

Python: 7/10 (Technical, suits Python's introspection but less common for DB)
Go: 6/10 (A bit less conventional, clear in purpose)
overview:

Python: 6/10 (Suggests summary, which might be misleading for detailed info)
Go: 5/10 (May not fully convey the depth of information provided)
summarize:

Python: 5/10 (Implies brevity, not ideal for detailed info)
Go: 4/10 (Might mislead about the function's depth)
settings:

Python: 3/10 (Misleading unless specifically about configuration settings)
Go: 3/10 (Not suitable unless for settings, could confuse)
get_attributes:

Python: 8/10 (Very clear if specifically about attributes, but narrow in scope)
Go: 7/10 (Explicit, but may be too specific if the method is more general)
These scores are based on how well the names align with each language's conventions and the expected functionality in a database context.

jeffchuber · 2023-12-06T18:46:40Z

i like describe. i also like the midwit naming power of GPT 😆

HammadB · 2023-12-06T18:47:25Z

Lets do describe then and make the output a TypedDict

tazarov · 2023-12-06T18:52:18Z

on it!

tazarov · 2023-12-06T20:06:31Z

@HammadB TAL, I think it's ready.

tazarov · 2024-09-17T05:15:36Z

Closing as dimensionality already exposed through collection model

feat: CIP - Expose Collection Dimensionality

c23952a

A new CIP to add ability for clients to view collection's dimensionality

tazarov and others added 2 commits October 22, 2023 19:44

Merge branch 'main' into feature/collection-dimensionality-exposed

63fd6df

fix: Formatting issues

4f9b59f

HammadB self-requested a review November 29, 2023 18:35

HammadB added the taz-sprint-1 label Nov 29, 2023

tazarov marked this pull request as draft December 1, 2023 08:07

tazarov added 2 commits December 2, 2023 15:49

Merge branch 'main' into feature/collection-dimensionality-exposed

028610f

# Conflicts: # chromadb/api/fastapi.py # chromadb/server/fastapi/__init__.py # chromadb/test/test_api.py # clients/js/yarn.lock

feat: Updated authz and tests to work with new API method

8d78aa9

tazarov marked this pull request as ready for review December 2, 2023 14:39

HammadB reviewed Dec 6, 2023

View reviewed changes

HammadB approved these changes Dec 6, 2023

View reviewed changes

HammadB reviewed Dec 6, 2023

View reviewed changes

feat: Changed dimensions to describe to encompass a wider range of uses.

1d7f6e1

HammadB added taz-sprint-2 and removed taz-sprint-1 labels Dec 8, 2023

tazarov closed this Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH]: CIP - Expose Collection Dimensionality #1217

[ENH]: CIP - Expose Collection Dimensionality #1217

tazarov commented Oct 8, 2023

github-actions bot commented Oct 8, 2023

tazarov commented Dec 2, 2023

beggers commented Dec 5, 2023

tazarov commented Dec 5, 2023

HammadB commented Dec 6, 2023

HammadB Dec 6, 2023

tazarov Dec 6, 2023

jeffchuber commented Dec 6, 2023

HammadB left a comment

HammadB commented Dec 6, 2023

tazarov commented Dec 6, 2023

HammadB commented Dec 6, 2023

HammadB commented Dec 6, 2023

jeffchuber commented Dec 6, 2023

HammadB Dec 6, 2023

HammadB commented Dec 6, 2023 •

edited

Loading

tazarov commented Dec 6, 2023

jeffchuber commented Dec 6, 2023

tazarov commented Dec 6, 2023

tazarov commented Dec 6, 2023 •

edited

Loading

tazarov commented Dec 6, 2023

jeffchuber commented Dec 6, 2023

HammadB commented Dec 6, 2023

tazarov commented Dec 6, 2023

tazarov commented Dec 6, 2023

tazarov commented Sep 17, 2024

		@@ -626,6 +626,11 @@ def _count(self, collection_id: UUID) -> int:
		return metadata_segment.count()

		@trace_method("SegmentAPI._query", OpenTelemetryGranularity.OPERATION)

[ENH]: CIP - Expose Collection Dimensionality #1217

[ENH]: CIP - Expose Collection Dimensionality #1217

Conversation

tazarov commented Oct 8, 2023

Description of changes

Test plan

Documentation Changes

github-actions bot commented Oct 8, 2023

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

tazarov commented Dec 2, 2023

beggers commented Dec 5, 2023

tazarov commented Dec 5, 2023

HammadB commented Dec 6, 2023

HammadB Dec 6, 2023

Choose a reason for hiding this comment

tazarov Dec 6, 2023

Choose a reason for hiding this comment

jeffchuber commented Dec 6, 2023

HammadB left a comment

Choose a reason for hiding this comment

HammadB commented Dec 6, 2023

tazarov commented Dec 6, 2023

HammadB commented Dec 6, 2023

HammadB commented Dec 6, 2023

jeffchuber commented Dec 6, 2023

HammadB Dec 6, 2023

Choose a reason for hiding this comment

HammadB commented Dec 6, 2023 • edited Loading

tazarov commented Dec 6, 2023

jeffchuber commented Dec 6, 2023

tazarov commented Dec 6, 2023

tazarov commented Dec 6, 2023 • edited Loading

tazarov commented Dec 6, 2023

jeffchuber commented Dec 6, 2023

HammadB commented Dec 6, 2023

tazarov commented Dec 6, 2023

tazarov commented Dec 6, 2023

tazarov commented Sep 17, 2024

HammadB commented Dec 6, 2023 •

edited

Loading

tazarov commented Dec 6, 2023 •

edited

Loading