Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Create Sherlock example for VDB Upload #1298

Closed
4 tasks done
mdemoret-nv opened this issue Oct 22, 2023 · 1 comment
Closed
4 tasks done

[FEA]: Create Sherlock example for VDB Upload #1298

mdemoret-nv opened this issue Oct 22, 2023 · 1 comment
Assignees
Labels
feature request New feature or request

Comments

@mdemoret-nv
Copy link
Contributor

mdemoret-nv commented Oct 22, 2023

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

High

Please provide a clear description of problem this feature solves

As part of the Sherlock work, an example showing how to use Morpheus to upload documents to a Vector Database (VDB) is needed.

Describe your ideal solution

Purpose

The purpose of this example is to illustrate how a user could build a pipeline which will take a set of documents, split those documents into chunks, calculate the embedding vector for each chunk, and upload those chunks with the embedding to a VDB.

Scenario

This example will show one single implementation but the pipeline and components could be used in many scenarios with different requirements. At a high level, the following illustrates different customization points for this pipeline and the specific choices made for this example:

  • Source documents
    • This pipeline could support any type of document which can be converted into text. This includes PDFs, web pages, structured documents and even images (with OCR).
    • For this example, we will be using RSS feeds and a web scraper as the source for our documents. This was chosen because it simulates a real-world cyber scenario (Cyber security RSS feeds could be used to build a repository of knowledge for a security chatbot) and does not require any dataset or API keys to function.
  • Embedding model
    • This pipeline can support any type of embedding model that can convert text into a vector of floats.
    • We have tested this pipeline with several different models available on Huggingface including paraphrase-multilingual-mpnet-base-v2, e5-large-v2 and all-mpnet-base-v2
    • For the example we will use all-MiniLM-L6-v2 since it is a small, quick model with an small embedding dimension of 384
  • Vector DB Service
    • Any vector database can be used to store the resulting embedding and corresponding metadata.
    • It would be trivial to update the example to use Chroma or FAISS if needed
    • For the example, we will be using Milvus since we have been working closely with them on making GPU accelerated indices

Implementation

This example will be composed of 3 different components all set up as different click commands.

Export model component

This command is necessary to export the embedding model into a Triton model repository to be loaded by Triton. Any model which is BERT based and hosted on Hugging face can be exported. The way the command functions is by downloading the model, adding some layers at the end for average pooling and normalization and exports the model using the Pytorch -> ONNX exporter. This model can then be imported by Triton and optimized with the built in ONNX->TRT converter.

By default the pipelines will use the all-MiniLM-L6-v2 model which has already been exported and saved into the repo using Git LFS. This model is preferred because it is small (only 90 Mb when exported) and fast.

Morpheus pipeline

The Morpheus pipeline is built using the following components:

  1. Ingest the RSS documents using our RSSSourceStage
  2. Convert the URLs into text using a custom WebScraperStage
    1. This stage downloads the HTML, then uses the BeautifulSoup library to extract the text. Other options exist but are very, very slow
  3. The embedding is calculated using stages from the SID workflow
    1. The PreprocessNLPStage calculates the tokens for each chunk
    2. The TritonInferenceStage determines the embedding using the all-MiniLM-L6-v2 model
  4. Finally, the embedding and documents are uploaded to the VectorDB using the WriteToVectorDBStage

LangChain pipeline (Optional)

As a comparison for performance, we should provide the equivalent pipeline using only Langchain to do an apples-to-apples comparison on performance. A few notes about the existing Langchain command currently in the prototype:

  • The Langchain library has a RSSLoader but it is not available in the 0.0.190 release. This release is the latest we can use from Conda because the next release requires Pandas 2.0+ which conflicts with the requirements of cuDF.
  • The RSSLoader out of the box uses a much more involved web scraper which is much slower. To perform a true apples-to-apples comparison, this would need to use the BeautifulSoup parser.
  • The Langchain pipelines can be very slow so getting true metrics on perf can be difficult. When using a ConfluenceLoader, we were able to see ~17x perf improvements over Langchain.

Completion Criteria

The following items need to be satisfied to consider this issue complete:

  • A README.md containing information on the following:
    • Background information about the problem at hand
    • Information about the specific implementation and reasoning behind design decisions (use content from this issue)
    • Step-by-step instructions for running the following:
      • How to export a different model from huggingface (use e5-large-v2)
        • How to run the Morpheus pipeline
        • Including instructions on starting a Milvus service
        • Including instructions on starting the Triton service
      • How to run the Langchain pipeline (optional)
    • The README.md should be linked in the developer docs
  • A functioning export model command which satisfies the following:
    • Should run without error using all default arguments
    • Export a Triton model which can be loaded without modification by Triton
    • Should work for the following models paraphrase-multilingual-mpnet-base-v2, e5-large-v2 and all-mpnet-base-v2
    • Have logging which can be increased to provide debugging details
  • A functioning Morpheus pipeline command which satisfies the following:
    • Should run without error using all default arguments
    • Correctly calculate embedding for supplied documents
    • Provide information about the success or failure of the pipeline. Including number of uploaded documents, throughput and total runtime.
  • (Optional) A functioning Langchain pipeline command which satisfies the following:
    • Should run without error using all default arguments
    • Provide similar results to the Morpheus pipeline
  • Tests should be added which include the following
    • Test successfully exporting a model
    • Test successfully running the Morpheus pipeline
    • (Optional) Test successfully running the Langchain pipeline

Dependent Issues

The following issues should be resolved before this can be completed:

Tasks

Preview Give feedback
  1. feature request sherlock
  2. feature request sherlock
    bsuryadevara
  3. 0 of 4
    sherlock
    cwharris
  4. feature request sherlock
    bsuryadevara

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request
@mdemoret-nv
Copy link
Contributor Author

Closing since it was completed in 23.11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Status: Done
Development

No branches or pull requests

2 participants