Skip to content

Commit

Permalink
[ENH] OTel tracing throughout the codebase (chroma-core#1238)
Browse files Browse the repository at this point in the history
## Description of changes

This PR adds OpenTelemetry tracing to ~all major methods throughout our
codebase. It also adds configuration to specify where these traces
should be sent. I focused more on laying the groundwork for tracing than
on collecting all the data we need everywhere.

Default behavior is unchanged: no tracing, no printing.

*Summarize the changes made by this PR.*
 - New functionality
	 - OpenTelemetryClient with relevant config.
	 - Wrap most of our code in tracing.

The only major design decision I made was to fully separate
OpenTelemetry stuff and Posthog (product telemetry) stuff.

Justification:

It's tempting to combine OTel and product telemetry behind a single
internal interface. I don't think this coupling is worth it. Product
telemetry cares about a small and relatively static set of uses, whereas
tracing by nature should be very deep in our codebase. I see two ways to
couple them and problems with each:
- Have a unified Telemetry interface only for the events our product
telemetry cares about and use raw OTel for the rest. In other words, use
this telemetry interface only for `collection.add()`s,
`collection.delete()`s, etc. This seems weird to me: tracing code would
be implicit in some cases but explicit in others, making the codebase
less easily comprehensible. Also if an engineer later decides to add
product telemetry to a codepath that already has tracing, they need to
know to remove existing tracing. This increases the cognitive overhead
required to work on Chroma, reducing the readability and maintainability
of our codebase.
- Have a unified Telemetry interface which does everything. In this
case, it has the above behavior but also wraps all other OTel behavior
we want. This seems weird to me because we're basically writing a
wrapper around the complete set of OpenTelemetry behavior which only
modifies a small part of it. This increases our maintenance burden for
very little value.

Instead we have two well-encapsulated telemetry modules which we can
modify and use without worrying about the other telemetry. The OTel
module provides some lightweight helpers to make OTel a little easier to
use, but we can put raw OTel code throughout our codebase and it'll play
nicely.

## Test plan
*How are these changes tested?*

- [ ] Tests pass locally with `pytest` for python, `yarn test` for js

Manual testing:
- Set environment variables to export traces to Honeycomb at
 various granularities.
- Went through various basic Chroma flows and checked that traces show
up in Honeycomb as expected.

![Screenshot 2023-10-12 at 10 39 11
AM](https://github.com/chroma-core/chroma/assets/64657842/49c95054-ef7f-42b1-bb14-4b372edf9343)

## Documentation Changes
*Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
repository](https://github.com/chroma-core/docs)?*

Docs PR to land before this does.
  • Loading branch information
beggers authored Oct 18, 2023
1 parent 9d89b96 commit 99c8a99
Show file tree
Hide file tree
Showing 33 changed files with 766 additions and 195 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ index_data
# Default configuration for persist_directory in chromadb/config.py
# Currently it's located in "./chroma/"
chroma/
chroma_test_data
chroma_test_data/
server.htpasswd

.venv
Expand Down
Binary file added chroma_data/chroma.sqlite3
Binary file not shown.
18 changes: 11 additions & 7 deletions chromadb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,8 @@
"QueryResult",
"GetResult",
]
from chromadb.telemetry.events import ClientStartEvent
from chromadb.telemetry import Telemetry
from chromadb.telemetry.product.events import ClientStartEvent
from chromadb.telemetry.product import ProductTelemetryClient


logger = logging.getLogger(__name__)
Expand All @@ -56,12 +56,14 @@
is_client = False
try:
from chromadb.is_thin_client import is_thin_client # type: ignore

is_client = is_thin_client
except ImportError:
is_client = False

if not is_client:
import sqlite3

if sqlite3.sqlite_version_info < (3, 35, 0):
if IN_COLAB:
# In Colab, hotswap to pysqlite-binary if it's too old
Expand All @@ -75,8 +77,11 @@
sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")
else:
raise RuntimeError(
"\033[91mYour system has an unsupported version of sqlite3. Chroma requires sqlite3 >= 3.35.0.\033[0m\n"
"\033[94mPlease visit https://docs.trychroma.com/troubleshooting#sqlite to learn how to upgrade.\033[0m"
"\033[91mYour system has an unsupported version of sqlite3. Chroma \
requires sqlite3 >= 3.35.0.\033[0m\n"
"\033[94mPlease visit \
https://docs.trychroma.com/troubleshooting#sqlite to learn how \
to upgrade.\033[0m"
)


Expand Down Expand Up @@ -147,12 +152,11 @@ def Client(settings: Settings = __settings) -> API:

system = System(settings)

telemetry_client = system.instance(Telemetry)
product_telemetry_client = system.instance(ProductTelemetryClient)
api = system.instance(API)

system.start()

# Submit event for client start
telemetry_client.capture(ClientStartEvent())
product_telemetry_client.capture(ClientStartEvent())

return api
34 changes: 31 additions & 3 deletions chromadb/api/fastapi.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,12 @@
from chromadb.auth.providers import RequestsClientAuthProtocolAdapter
from chromadb.auth.registry import resolve_provider
from chromadb.config import Settings, System
from chromadb.telemetry import Telemetry
from chromadb.telemetry.opentelemetry import (
OpenTelemetryClient,
OpenTelemetryGranularity,
trace_method,
)
from chromadb.telemetry.product import ProductTelemetryClient
from urllib.parse import urlparse, urlunparse, quote

logger = logging.getLogger(__name__)
Expand All @@ -51,7 +56,8 @@ def _validate_host(host: str) -> None:
if "/" in host and (not host.startswith("http")):
raise ValueError(
"Invalid URL. "
"Seems that you are trying to pass URL as a host but without specifying the protocol. "
"Seems that you are trying to pass URL as a host but without \
specifying the protocol. "
"Please add http:// or https:// to the host."
)

Expand Down Expand Up @@ -92,7 +98,8 @@ def __init__(self, system: System):
system.settings.require("chroma_server_host")
system.settings.require("chroma_server_http_port")

self._telemetry_client = self.require(Telemetry)
self._opentelemetry_client = self.require(OpenTelemetryClient)
self._product_telemetry_client = self.require(ProductTelemetryClient)
self._settings = system.settings

self._api_url = FastAPI.resolve_url(
Expand Down Expand Up @@ -127,13 +134,15 @@ def __init__(self, system: System):
if self._header is not None:
self._session.headers.update(self._header)

@trace_method("FastAPI.heartbeat", OpenTelemetryGranularity.OPERATION)
@override
def heartbeat(self) -> int:
"""Returns the current server time in nanoseconds to check if the server is alive"""
resp = self._session.get(self._api_url)
raise_chroma_error(resp)
return int(resp.json()["nanosecond heartbeat"])

@trace_method("FastAPI.list_collections", OpenTelemetryGranularity.OPERATION)
@override
def list_collections(self) -> Sequence[Collection]:
"""Returns a list of all collections"""
Expand All @@ -146,6 +155,7 @@ def list_collections(self) -> Sequence[Collection]:

return collections

@trace_method("FastAPI.create_collection", OpenTelemetryGranularity.OPERATION)
@override
def create_collection(
self,
Expand All @@ -171,6 +181,7 @@ def create_collection(
metadata=resp_json["metadata"],
)

@trace_method("FastAPI.get_collection", OpenTelemetryGranularity.OPERATION)
@override
def get_collection(
self,
Expand All @@ -189,6 +200,9 @@ def get_collection(
metadata=resp_json["metadata"],
)

@trace_method(
"FastAPI.get_or_create_collection", OpenTelemetryGranularity.OPERATION
)
@override
def get_or_create_collection(
self,
Expand All @@ -200,6 +214,7 @@ def get_or_create_collection(
name, metadata, embedding_function, get_or_create=True
)

@trace_method("FastAPI._modify", OpenTelemetryGranularity.OPERATION)
@override
def _modify(
self,
Expand All @@ -214,12 +229,14 @@ def _modify(
)
raise_chroma_error(resp)

@trace_method("FastAPI.delete_collection", OpenTelemetryGranularity.OPERATION)
@override
def delete_collection(self, name: str) -> None:
"""Deletes a collection"""
resp = self._session.delete(self._api_url + "/collections/" + name)
raise_chroma_error(resp)

@trace_method("FastAPI._count", OpenTelemetryGranularity.OPERATION)
@override
def _count(self, collection_id: UUID) -> int:
"""Returns the number of embeddings in the database"""
Expand All @@ -229,6 +246,7 @@ def _count(self, collection_id: UUID) -> int:
raise_chroma_error(resp)
return cast(int, resp.json())

@trace_method("FastAPI._peek", OpenTelemetryGranularity.OPERATION)
@override
def _peek(self, collection_id: UUID, n: int = 10) -> GetResult:
return self._get(
Expand All @@ -237,6 +255,7 @@ def _peek(self, collection_id: UUID, n: int = 10) -> GetResult:
include=["embeddings", "documents", "metadatas"],
)

@trace_method("FastAPI._get", OpenTelemetryGranularity.OPERATION)
@override
def _get(
self,
Expand Down Expand Up @@ -279,6 +298,7 @@ def _get(
documents=body.get("documents", None),
)

@trace_method("FastAPI._delete", OpenTelemetryGranularity.OPERATION)
@override
def _delete(
self,
Expand All @@ -298,6 +318,7 @@ def _delete(
raise_chroma_error(resp)
return cast(IDs, resp.json())

@trace_method("FastAPI._submit_batch", OpenTelemetryGranularity.ALL)
def _submit_batch(
self,
batch: Tuple[
Expand All @@ -321,6 +342,7 @@ def _submit_batch(
)
return resp

@trace_method("FastAPI._add", OpenTelemetryGranularity.ALL)
@override
def _add(
self,
Expand All @@ -340,6 +362,7 @@ def _add(
raise_chroma_error(resp)
return True

@trace_method("FastAPI._update", OpenTelemetryGranularity.ALL)
@override
def _update(
self,
Expand All @@ -361,6 +384,7 @@ def _update(
resp.raise_for_status()
return True

@trace_method("FastAPI._upsert", OpenTelemetryGranularity.ALL)
@override
def _upsert(
self,
Expand All @@ -382,6 +406,7 @@ def _upsert(
resp.raise_for_status()
return True

@trace_method("FastAPI._query", OpenTelemetryGranularity.ALL)
@override
def _query(
self,
Expand Down Expand Up @@ -417,13 +442,15 @@ def _query(
documents=body.get("documents", None),
)

@trace_method("FastAPI.reset", OpenTelemetryGranularity.ALL)
@override
def reset(self) -> bool:
"""Resets the database"""
resp = self._session.post(self._api_url + "/reset")
raise_chroma_error(resp)
return cast(bool, resp.json())

@trace_method("FastAPI.get_version", OpenTelemetryGranularity.OPERATION)
@override
def get_version(self) -> str:
"""Returns the version of the server"""
Expand All @@ -437,6 +464,7 @@ def get_settings(self) -> Settings:
return self._settings

@property
@trace_method("FastAPI.max_batch_size", OpenTelemetryGranularity.OPERATION)
@override
def max_batch_size(self) -> int:
if self._max_batch_size == -1:
Expand Down
Loading

0 comments on commit 99c8a99

Please sign in to comment.