-
Notifications
You must be signed in to change notification settings - Fork 144
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add WebSearch Retriever Microservice (#209)
* add gateway Signed-off-by: Spycsh <[email protected]> * Add web retriever comp * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add entry * add persistent dir * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add correct delete * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update test_web_retrievers_langchain.sh * Update test_web_retrievers_langchain.sh * revert * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Spycsh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: chen, suyue <[email protected]> Signed-off-by: Wu, Xiaochang <[email protected]>
- Loading branch information
Showing
10 changed files
with
315 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,6 +34,7 @@ | |
CodeTransGateway, | ||
DocSumGateway, | ||
TranslationGateway, | ||
SearchQnAGateway, | ||
AudioQnAGateway, | ||
) | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# Web Retriever Microservice | ||
|
||
The Web Retriever Microservice is designed to efficiently search web pages relevant to the prompt, save them into the VectorDB, and retrieve the matched documents with the highest similarity. The retrieved documents will be used as context in the prompt to LLMs. Different from the normal RAG process, a web retriever can leverage advanced search engines for more diverse demands, such as real-time news, verifiable sources, and diverse sources. | ||
|
||
# Start Microservice with Docker | ||
|
||
## Build Docker Image | ||
|
||
```bash | ||
cd ../../ | ||
docker build -t opea/web-retriever-chroma:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/web_retrievers/langchain/chroma/docker/Dockerfile . | ||
``` | ||
|
||
## Start TEI Service | ||
|
||
```bash | ||
model=BAAI/bge-base-en-v1.5 | ||
revision=refs/pr/4 | ||
volume=$PWD/data | ||
docker run -d -p 6060:80 -v $volume:/data -e http_proxy=$http_proxy -e https_proxy=$https_proxy --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.2 --model-id $model --revision $revision | ||
``` | ||
|
||
## Start Web Retriever Service | ||
|
||
```bash | ||
# set TEI endpoint | ||
export TEI_EMBEDDING_ENDPOINT="http://${your_ip}:6060" | ||
|
||
# set search engine env variables | ||
export GOOGLE_API_KEY=xxx | ||
export GOOGLE_CSE_ID=xxx | ||
``` | ||
|
||
```bash | ||
docker run -d --name="web-retriever-chroma-server" -p 7078:7077 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e TEI_EMBEDDING_ENDPOINT=$TEI_EMBEDDING_ENDPOINT opea/web-retriever-chroma:latest | ||
``` | ||
|
||
## Consume Web Retriever Service | ||
|
||
To consume the Web Retriever Microservice, you can generate a mock embedding vector of length 768 with Python. | ||
|
||
```bash | ||
# Test | ||
your_embedding=$(python -c "import random; embedding = [random.uniform(-1, 1) for _ in range(768)]; print(embedding)") | ||
|
||
curl http://${your_ip}:7077/v1/web_retrieval \ | ||
-X POST \ | ||
-d "{\"text\":\"What is OPEA?\",\"embedding\":${your_embedding}}" \ | ||
-H 'Content-Type: application/json' | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
|
||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
FROM langchain/langchain:latest | ||
|
||
RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \ | ||
libgl1-mesa-glx \ | ||
libjemalloc-dev \ | ||
vim | ||
|
||
COPY comps /home/user/comps | ||
|
||
RUN pip install --no-cache-dir --upgrade pip && \ | ||
pip install --no-cache-dir -r /home/user/comps/web_retrievers/langchain/chroma/requirements.txt | ||
|
||
ENV PYTHONPATH=$PYTHONPATH:/home/user | ||
|
||
WORKDIR /home/user/comps/web_retrievers/langchain/chroma | ||
|
||
ENTRYPOINT ["python", "retriever_chroma.py"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
bs4 | ||
chromadb | ||
docarray[full] | ||
fastapi | ||
google-api-python-client>=2.100.0 | ||
html2text | ||
langchain-huggingface | ||
langchain_community | ||
opentelemetry-api | ||
opentelemetry-exporter-otlp | ||
opentelemetry-sdk | ||
prometheus-fastapi-instrumentator | ||
sentence_transformers | ||
shortuuid |
121 changes: 121 additions & 0 deletions
121
comps/web_retrievers/langchain/chroma/retriever_chroma.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
import os | ||
import time | ||
|
||
from langchain.text_splitter import RecursiveCharacterTextSplitter | ||
from langchain_community.document_loaders import AsyncHtmlLoader | ||
from langchain_community.document_transformers import Html2TextTransformer | ||
from langchain_community.utilities import GoogleSearchAPIWrapper | ||
from langchain_community.vectorstores import Chroma | ||
from langchain_huggingface import HuggingFaceEndpointEmbeddings | ||
|
||
from comps import ( | ||
EmbedDoc768, | ||
SearchedDoc, | ||
ServiceType, | ||
TextDoc, | ||
opea_microservices, | ||
register_microservice, | ||
register_statistics, | ||
statistics_dict, | ||
) | ||
|
||
tei_embedding_endpoint = os.getenv("TEI_EMBEDDING_ENDPOINT") | ||
|
||
|
||
def get_urls(query, num_search_result=1): | ||
result = search.results(query, num_search_result) | ||
return result | ||
|
||
|
||
def retrieve_htmls(all_urls): | ||
loader = AsyncHtmlLoader(all_urls, ignore_load_errors=True, trust_env=True) | ||
docs = loader.load() | ||
return docs | ||
|
||
|
||
def parse_htmls(docs): | ||
print("Indexing new urls...") | ||
|
||
html2text = Html2TextTransformer() | ||
docs = list(html2text.transform_documents(docs)) | ||
docs = text_splitter.split_documents(docs) | ||
|
||
return docs | ||
|
||
|
||
def dump_docs(docs): | ||
vector_db.add_documents(docs) | ||
|
||
|
||
@register_microservice( | ||
name="opea_service@web_retriever_chroma", | ||
service_type=ServiceType.RETRIEVER, | ||
endpoint="/v1/web_retrieval", | ||
host="0.0.0.0", | ||
port=7077, | ||
) | ||
@register_statistics(names=["opea_service@web_retriever_chroma", "opea_service@search"]) | ||
def web_retrieve(input: EmbedDoc768) -> SearchedDoc: | ||
start = time.time() | ||
query = input.text | ||
embedding = input.embedding | ||
|
||
# Google Search the results, parse the htmls | ||
search_results = get_urls(query) | ||
urls_to_look = [] | ||
for res in search_results: | ||
if res.get("link", None): | ||
urls_to_look.append(res["link"]) | ||
urls = list(set(urls_to_look)) | ||
print(f"urls: {urls}") | ||
docs = retrieve_htmls(urls) | ||
docs = parse_htmls(docs) | ||
print(docs) | ||
# Remove duplicated docs | ||
unique_documents_dict = {(doc.page_content, tuple(sorted(doc.metadata.items()))): doc for doc in docs} | ||
unique_documents = list(unique_documents_dict.values()) | ||
statistics_dict["opea_service@search"].append_latency(time.time() - start, None) | ||
|
||
# dump to vector_db | ||
dump_docs(unique_documents) | ||
|
||
# Do the retrieval | ||
search_res = vector_db.similarity_search_by_vector(embedding=embedding) | ||
|
||
searched_docs = [] | ||
|
||
for r in search_res: | ||
# include the metadata into the retrieved docs content | ||
description_str = f"\n description: {r.metadata['description']} \n" if "description" in r.metadata else "" | ||
title_str = f"\n title: {r.metadata['title']} \n" if "title" in r.metadata else "" | ||
source_str = f"\n source: {r.metadata['source']} \n" if "source" in r.metadata else "" | ||
text_with_meta = f"{r.page_content} {description_str} {title_str} {source_str}" | ||
searched_docs.append(TextDoc(text=text_with_meta)) | ||
|
||
result = SearchedDoc(retrieved_docs=searched_docs, initial_query=query) | ||
statistics_dict["opea_service@web_retriever_chroma"].append_latency(time.time() - start, None) | ||
|
||
# For Now history is banned | ||
if vector_db.get()["ids"]: | ||
vector_db.delete(vector_db.get()["ids"]) | ||
return result | ||
|
||
|
||
if __name__ == "__main__": | ||
# Create vectorstore | ||
tei_embedding_endpoint = os.getenv("TEI_EMBEDDING_ENDPOINT") | ||
# vectordb_persistent_directory = os.getenv("VECTORDB_PERSISTENT_DIR", "/home/user/chroma_db_oai") | ||
vector_db = Chroma( | ||
embedding_function=HuggingFaceEndpointEmbeddings(model=tei_embedding_endpoint), | ||
# persist_directory=vectordb_persistent_directory | ||
) | ||
|
||
google_api_key = os.environ.get("GOOGLE_API_KEY") | ||
google_cse_id = os.environ.get("GOOGLE_CSE_ID") | ||
search = GoogleSearchAPIWrapper(google_api_key=google_api_key, google_cse_id=google_cse_id, k=10) | ||
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=50) | ||
|
||
opea_microservices["opea_service@web_retriever_chroma"].start() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
#!/bin/bash | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
set -xe | ||
|
||
WORKPATH=$(dirname "$PWD") | ||
ip_address=$(hostname -I | awk '{print $1}') | ||
function build_docker_images() { | ||
cd $WORKPATH | ||
docker build --no-cache -t opea/web-retriever-chroma:comps --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/web_retrievers/langchain/chroma/docker/Dockerfile . | ||
} | ||
|
||
function start_service() { | ||
|
||
# tei endpoint | ||
tei_endpoint=5018 | ||
model="BAAI/bge-base-en-v1.5" | ||
docker run -d --name="test-comps-web-retriever-tei-endpoint" -p $tei_endpoint:80 -v ./data:/data --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.2 --model-id $model | ||
export TEI_EMBEDDING_ENDPOINT="http://${ip_address}:${tei_endpoint}" | ||
|
||
# chroma web retriever | ||
retriever_port=5019 | ||
unset http_proxy | ||
docker run -d --name="test-comps-web-retriever-chroma-server" -p ${retriever_port}:7077 --ipc=host -e GOOGLE_API_KEY=$GOOGLE_API_KEY -e GOOGLE_CSE_ID=$GOOGLE_CSE_ID -e TEI_EMBEDDING_ENDPOINT=$TEI_EMBEDDING_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/web-retriever-chroma:comps | ||
|
||
sleep 3m | ||
} | ||
|
||
function validate_microservice() { | ||
retriever_port=5019 | ||
export PATH="${HOME}/miniforge3/bin:$PATH" | ||
test_embedding=$(python -c "import random; embedding = [random.uniform(-1, 1) for _ in range(768)]; print(embedding)") | ||
http_proxy='' curl http://${ip_address}:$retriever_port/v1/web_retrieval \ | ||
-X POST \ | ||
-d "{\"text\":\"What is OPEA?\",\"embedding\":${test_embedding}}" \ | ||
-H 'Content-Type: application/json' | ||
docker logs test-comps-web-retriever-tei-endpoint | ||
} | ||
|
||
function stop_docker() { | ||
cid_retrievers=$(docker ps -aq --filter "name=test-comps-web*") | ||
if [[ ! -z "$cid_retrievers" ]]; then | ||
docker stop $cid_retrievers && docker rm $cid_retrievers && sleep 1s | ||
fi | ||
} | ||
|
||
function main() { | ||
|
||
stop_docker | ||
|
||
build_docker_images | ||
start_service | ||
|
||
validate_microservice | ||
|
||
stop_docker | ||
echo y | docker system prune | ||
|
||
} | ||
|
||
main |