Skip to content

Commit

Permalink
add: Pathway vector store and retriever as LangChain component (opea-…
Browse files Browse the repository at this point in the history
…project#342)

* nb

Signed-off-by: Berke <[email protected]>

* init changes

Signed-off-by: Berke <[email protected]>

* docker

Signed-off-by: Berke <[email protected]>

* example data

Signed-off-by: Berke <[email protected]>

* docs(readme): update, add commands

Signed-off-by: Berke <[email protected]>

* fix: formatting, data sources

Signed-off-by: Berke <[email protected]>

* docs(readme): update instructions, add comments

Signed-off-by: Berke <[email protected]>

* fix: rm unused parts

Signed-off-by: Berke <[email protected]>

* fix: image name, compose env vars

Signed-off-by: Berke <[email protected]>

* fix: rm unused part

Signed-off-by: Berke <[email protected]>

* fix: logging name

Signed-off-by: Berke <[email protected]>

* fix: env var

Signed-off-by: Berke <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Berke <[email protected]>

* fix: rename pw docker

Signed-off-by: Berke <[email protected]>

* docs(readme): update input sources

Signed-off-by: Berke <[email protected]>

* nb

Signed-off-by: Berke <[email protected]>

* init changes

Signed-off-by: Berke <[email protected]>

* fix: formatting, data sources

Signed-off-by: Berke <[email protected]>

* docs(readme): update instructions, add comments

Signed-off-by: Berke <[email protected]>

* fix: rm unused part

Signed-off-by: Berke <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Berke <[email protected]>

* fix: rename pw docker

Signed-off-by: Berke <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Berke <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* feat: mv vector store, naming, clarify instructions, improve ingestion components

Signed-off-by: Berke <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tests: add pw retriever test
fix: update docker to include libmagic

Signed-off-by: Berke <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* implement suggestions from review, entrypoint, reqs, comments, https_proxy.

Signed-off-by: Berke <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: update docker tags in test and readme

Signed-off-by: Berke <[email protected]>

* tests: add separate pathway vectorstore test

Signed-off-by: Berke <[email protected]>

---------

Signed-off-by: Berke <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Sihan Chen <[email protected]>
  • Loading branch information
3 people authored Aug 29, 2024
1 parent 6d4b668 commit 2c2322e
Show file tree
Hide file tree
Showing 15 changed files with 618 additions and 0 deletions.
104 changes: 104 additions & 0 deletions comps/retrievers/langchain/pathway/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Retriever Microservice with Pathway

## 🚀Start Microservices

### With the Docker CLI

We suggest using `docker compose` to run this app, refer to [`docker compose`](#with-the-docker-compose) section below.

If you prefer to run them separately, refer to this section.

#### (Optionally) Start the TEI (embedder) service separately

> Note that Docker compose will start this service as well, this step is thus optional.
```bash
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=${your_langchain_api_key}
export LANGCHAIN_PROJECT="opea/retriever"
model=BAAI/bge-base-en-v1.5
revision=refs/pr/4
# TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060" # if you want to use the hosted embedding service, example: "http://127.0.0.1:6060"

# then run:
docker run -p 6060:80 -e http_proxy=$http_proxy -e https_proxy=$https_proxy --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.2 --model-id $model --revision $revision
```

Health check the embedding service with:

```bash
curl 127.0.0.1:6060/embed -X POST -d '{"inputs":"What is Deep Learning?"}' -H 'Content-Type: application/json'
```

If the model supports re-ranking, you can also use:

```bash
curl 127.0.0.1:6060/rerank -X POST -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."]}' -H 'Content-Type: application/json'
```

#### Start Retriever Service

Retriever service queries the Pathway vector store on incoming requests.
Make sure that Pathway vector store is already running, [see Pathway vector store here](../../../vectorstores/langchain/pathway/README.md).

Retriever service expects the Pathway host and port variables to connect to the vector DB. Set the Pathway vector store environment variables.

```bash
export PATHWAY_HOST=0.0.0.0
export PATHWAY_PORT=8666
```

```bash
# make sure you are in the root folder of the repo
docker build -t opea/retriever-pathway:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/retrievers/langchain/pathway/docker/Dockerfile .

docker run -p 7000:7000 -e PATHWAY_HOST=${PATHWAY_HOST} -e PATHWAY_PORT=${PATHWAY_PORT} -e http_proxy=$http_proxy -e https_proxy=$https_proxy --network="host" opea/retriever-pathway:latest
```

### With the Docker compose

First, set the env variables:

```bash
export PATHWAY_HOST=0.0.0.0
export PATHWAY_PORT=8666
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=${your_langchain_api_key}
export LANGCHAIN_PROJECT="opea/retriever"
model=BAAI/bge-base-en-v1.5
revision=refs/pr/4
# TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060" # if you want to use the hosted embedding service, example: "http://127.0.0.1:6060"
```

Text embeddings inference service expects the `RETRIEVE_MODEL_ID` variable to be set.

```bash
export RETRIEVE_MODEL_ID=BAAI/bge-base-en-v1.5
```

Note that following docker compose sets the `network_mode: host` in retriever image to allow local vector store connection.
This will start the both the embedding and retriever services:

```bash
cd comps/retrievers/langchain/pathway/docker

docker compose -f docker_compose_retriever.yaml build
docker compose -f docker_compose_retriever.yaml up

# shut down the containers
docker compose -f docker_compose_retriever.yaml down
```

Make sure the retriever service is working as expected:

```bash
curl http://0.0.0.0:7000/v1/health_check -X GET -H 'Content-Type: application/json'
```

send an example query:

```bash
exm_embeddings=$(python -c "import random; embedding = [random.uniform(-1, 1) for _ in range(768)]; print(embedding)")

curl http://0.0.0.0:7000/v1/retrieval -X POST -d "{\"text\":\"What is the revenue of Nike in 2023?\",\"embedding\":${exm_embeddings}}" -H 'Content-Type: application/json'
```
30 changes: 30 additions & 0 deletions comps/retrievers/langchain/pathway/docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@

# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FROM langchain/langchain:latest

ARG ARCH="cpu"

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
libgl1-mesa-glx \
libjemalloc-dev \
vim

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

COPY comps /home/user/comps

USER user

RUN pip install --no-cache-dir --upgrade pip && \
if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
pip install --no-cache-dir -r /home/user/comps/retrievers/langchain/pathway/requirements.txt

ENV PYTHONPATH=$PYTHONPATH:/home/user

WORKDIR /home/user/comps/retrievers/langchain/pathway

ENTRYPOINT ["bash", "entrypoint.sh"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

version: "3.8"

services:
tei_xeon_service:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.2
container_name: tei-xeon-server
ports:
- "6060:80"
volumes:
- "./data:/data"
shm_size: 1g
command: --model-id ${RETRIEVE_MODEL_ID}
retriever:
image: opea/retriever-pathway:latest
container_name: retriever-pathway-server
ports:
- "7000:7000"
ipc: host
network_mode: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
PATHWAY_HOST: ${PATHWAY_HOST}
PATHWAY_PORT: ${PATHWAY_PORT}
TEI_EMBEDDING_ENDPOINT: ${TEI_EMBEDDING_ENDPOINT}
LANGCHAIN_API_KEY: ${LANGCHAIN_API_KEY}
restart: unless-stopped

networks:
default:
driver: bridge
6 changes: 6 additions & 0 deletions comps/retrievers/langchain/pathway/entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

pip --no-cache-dir install -r requirements-runtime.txt

python retriever_pathway.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
langsmith
12 changes: 12 additions & 0 deletions comps/retrievers/langchain/pathway/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
docarray[full]
fastapi
frontend==0.0.3
huggingface_hub
langchain_community == 0.2.0
opentelemetry-api
opentelemetry-exporter-otlp
opentelemetry-sdk
pathway
prometheus-fastapi-instrumentator
sentence_transformers
shortuuid
52 changes: 52 additions & 0 deletions comps/retrievers/langchain/pathway/retriever_pathway.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import os
import time

from langchain_community.vectorstores import PathwayVectorClient
from langsmith import traceable

from comps import (
EmbedDoc,
SearchedDoc,
ServiceType,
TextDoc,
opea_microservices,
register_microservice,
register_statistics,
statistics_dict,
)

host = os.getenv("PATHWAY_HOST", "127.0.0.1")
port = int(os.getenv("PATHWAY_PORT", 8666))

EMBED_MODEL = os.getenv("EMBED_MODEL", "BAAI/bge-base-en-v1.5")

tei_embedding_endpoint = os.getenv("TEI_EMBEDDING_ENDPOINT")


@register_microservice(
name="opea_service@retriever_pathway",
service_type=ServiceType.RETRIEVER,
endpoint="/v1/retrieval",
host="0.0.0.0",
port=7000,
)
@traceable(run_type="retriever")
@register_statistics(names=["opea_service@retriever_pathway"])
def retrieve(input: EmbedDoc) -> SearchedDoc:
start = time.time()
documents = pw_client.similarity_search(input.text, input.fetch_k)

docs = [TextDoc(text=r.page_content) for r in documents]

time_spent = time.time() - start
statistics_dict["opea_service@retriever_pathway"].append_latency(time_spent, None) # noqa: E501
return SearchedDoc(retrieved_docs=docs, initial_query=input.text)


if __name__ == "__main__":
# Create the vectorstore client
pw_client = PathwayVectorClient(host=host, port=port)
opea_microservices["opea_service@retriever_pathway"].start()
4 changes: 4 additions & 0 deletions comps/vectorstores/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,7 @@ For details, please refer to this [readme](langchain/pgvector/README.md)
## Vectorstores Microservice with Pinecone

For details, please refer to this [readme](langchain/pinecone/README.md)

## Vectorstores Microservice with Pathway

For details, please refer to this [readme](langchain/pathway/README.md)
25 changes: 25 additions & 0 deletions comps/vectorstores/langchain/pathway/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

FROM pathwaycom/pathway:0.13.2-slim

ENV DOCKER_BUILDKIT=1
ENV PYTHONUNBUFFERED=1

RUN apt-get update && apt-get install -y \
poppler-utils \
libreoffice \
libmagic-dev \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt /app/

RUN pip install --no-cache-dir -r requirements.txt

COPY vectorstore_pathway.py /app/


CMD ["python", "vectorstore_pathway.py"]

84 changes: 84 additions & 0 deletions comps/vectorstores/langchain/pathway/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Start the Pathway Vector DB Server

Set the environment variables for Pathway, and the embedding model.

> Note: If you are using `TEI_EMBEDDING_ENDPOINT`, make sure embedding service is already running.
> See the instructions under [here](../../../retrievers/langchain/pathway/README.md)
```bash
export PATHWAY_HOST=0.0.0.0
export PATHWAY_PORT=8666
# TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060" # uncomment if you want to use the hosted embedding service, example: "http://127.0.0.1:6060"
```

## Configuration

### Setting up the Pathway data sources

Pathway can listen to many sources simultaneously, such as local files, S3 folders, cloud storage, and any data stream. Whenever a new file is added or an existing file is modified, Pathway parses, chunks and indexes the documents in real-time.

See [pathway-io](https://pathway.com/developers/api-docs/pathway-io) for more information.

You can easily connect to the data inside the folder with the Pathway file system connector. The data will automatically be updated by Pathway whenever the content of the folder changes. In this example, we create a single data source that reads the files under the `./data` folder.

You can manage your data sources by configuring the `data_sources` in `vectorstore_pathway.py`.

```python
import pathway as pw

data = pw.io.fs.read(
"./data",
format="binary",
mode="streaming",
with_metadata=True,
) # This creates a Pathway connector that tracks
# all the files in the ./data directory

data_sources = [data]
```

### Other configs (parser, splitter and the embedder)

Pathway vectorstore handles the ingestion and processing of the documents.
This allows you to configure the parser, splitter and the embedder.
Whenever a file is added or modified in one of the sources, Pathway will automatically ingest the file.

By default, `ParseUnstructured` parser, `langchain.text_splitter.CharacterTextSplitter` splitter and `BAAI/bge-base-en-v1.5` embedder are used.

For more information, see the relevant Pathway docs:

- [Vector store docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/vectorstore)
- [parsers docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/parsers)
- [splitters docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/splitters)
- [embedders docs](https://pathway.com/developers/api-docs/pathway-xpacks-llm/embedders)

## Building and running

Build the Docker and run the Pathway Vector Store:

```bash
cd comps/vectorstores/langchain/pathway

docker build --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -t opea/vectorstore-pathway:latest .

# with locally loaded model, you may add `EMBED_MODEL` env variable to configure the model.
docker run -e PATHWAY_HOST=${PATHWAY_HOST} -e PATHWAY_PORT=${PATHWAY_PORT} -e http_proxy=$http_proxy -e https_proxy=$https_proxy -v ./data:/app/data -p ${PATHWAY_PORT}:${PATHWAY_PORT} opea/vectorstore-pathway:latest

# with the hosted embedder (network argument is needed for the vector server to reach to the embedding service)
docker run -e PATHWAY_HOST=${PATHWAY_HOST} -e PATHWAY_PORT=${PATHWAY_PORT} -e TEI_EMBEDDING_ENDPOINT=${TEI_EMBEDDING_ENDPOINT} -e http_proxy=$http_proxy -e https_proxy=$https_proxy -v ./data:/app/data -p ${PATHWAY_PORT}:${PATHWAY_PORT} --network="host" opea/vectorstore-pathway:latest
```

## Health check the vector store

Wait until the server finishes indexing the docs, and send the following request to check it.

```bash
curl -X 'POST' \
"http://$PATHWAY_HOST:$PATHWAY_PORT/v1/statistics" \
-H 'accept: */*' \
-H 'Content-Type: application/json'
```

This should respond with something like:

> `{"file_count": 1, "last_indexed": 1724325093, "last_modified": 1724317365}`
Binary file not shown.
4 changes: 4 additions & 0 deletions comps/vectorstores/langchain/pathway/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
langchain_openai
pathway[xpack-llm] >= 0.14.1
sentence_transformers
unstructured[all-docs] >= 0.10.28,<0.15
Loading

0 comments on commit 2c2322e

Please sign in to comment.