Information discovery using generative artificial intelligence (AI). This service allows for configuring multiple topics for users, so they can send queries and get intelligent AI-generated responses.
Table of Contents
Provides an API for asking about specific pre-configured topics.
Most topics will augment queries with relevant information from a knowledge library for that topic. Augmented queries will then be sent to a foundation large language model (LLM) for a response.
This is intended to primarily support a Retrieval Augmented Generation (RAG) architecture, where there is a knowledge library related to a topic.
The API itself is configurable per topic, so if a RAG architecture doesn't make sense for all topics, there is flexibility to support others.
In RAG, upon receiving a query, additional information is retrieved from a knowledge library, relevancy compared to user query, and prompt to a foundation LLM is augmented with the additional context from the knowledge library (alongside a configured system prompt to guide the LLM on how it should interpret the context and respond).
Knowledge Library:
- ✅ Chroma in-memory vector database
- ❔ Google Vertex AI Vector Search
- ❔ AWS Aurora Postgres with pgvector
- ❔ Others
Knowledge Library Embeddings:
- ✅ Google Embeddings
- ✅ OpenAI Embeddings
AI Model Support:
- ✅ Google Models (configurable, default model:
chat-bison
, IMPORTANT NOTE: Please usegemini-1.5-flash
instead.)chat-bison
is Google's PaLM 2 Model. This was decomissioned by Google in Oct 2024. You should prefer Gemini models such asgemini-1.5-flash
- See their docs for more model options
- ✅ OpenAI's Models (configurable, default model:
gpt-3.5-turbo
) - ❔ AWS Models
- ❔ Open Source Models
- ❔ Trained/tuned model(s)
- ❔ Others
Note: Our use of
langchain
makes adding new models and even architectures beyond RAG possible. Developers should look at the code in thegen3discoveryai/topic_chains
folder.
Gen3 builds on other open source libraries, specifications, and tools when we can, and we tend to lean towards the best tools in the community or research space as it evolves (especially in cases where we're on the bleeding edge like this).
In the case of generative AI and LLMs,
there is a lot of excellent work out there. We are building this on the
shoulders of giants for many of the knowledge libraries and foundation model
interactions. We're using langchain
, chromadb
, among others.
We've even contributed back to open source tools like chromadb
to improve its ability to operate in a FIPS-compliant
environment. ❤️
This documented setup relies on both our Google Vertex AI support and OpenAI support.
OpenAI is NOT intended for production use in Gen3 (due to FedRAMP requirements).
Set the GOOGLE_APPLICATION_CREDENTIALS
environment variable as the path to
a valid credentials JSON file (likely a service account key).
See Google Cloud Platform docs for more info.
The credentials will need IAM permissions in a Google Project with Google Vertex AI enabled (which requires the setup
of a billing account). The IAM permissions required are captured in Google's predefined role: Vertex AI User
.
Create OpenAI API Account and get OpenAI API key (you have to attach a credit card).
NOTE: You should set a reasonable spend limit, the default is large
Topics are a combination of system prompt, description, what topic chain to use, and additional metadata (usually used by the topic chain). This service can support multiple topics at once with different topic chains.
You can have a topic of "Gen3 Documentation" and "Data Commons Datasets" at the same time.
The configuration is done via a .env
which allows environment variable overrides if you don't want to use the actual file.
Here's an example .env
file you can copy and modify:
########## Secrets ##########
OPENAI_API_KEY=REDACTED
GOOGLE_APPLICATION_CREDENTIALS=/home/user/creds.json
########## Topic Configuration ##########
# you must have `default`, you can add others a comma-separated string
TOPICS=default,anothertopic,gen3docs
# default setup. These will be used both for the actual default topic AND as the value for subsequent topics
# when a configuration is not provided. e.g. if you don't provide FOOBAR_SYSTEM_PROMPT then the DEFAULT_SYSTEM_PROMPT
# will be used
DEFAULT_SYSTEM_PROMPT=You are acting as a search assistant for a researcher who will be asking you questions about data available in a particular system. If you believe the question is not relevant to data in the system, do not answer. The researcher is likely trying to find data of interest for a particular reason or with specific criteria. You answer and recommend datasets that may be of interest to that researcher based on the context you're provided. If you are using any particular context to answer, you should cite that and tell the user where they can find more information. The user may not be able to see the documents you are citing, so provide the relevant information in your response. If you don't know the answer, just say that you don't know, don't try to make up an answer. If you don't believe what the user is looking for is available in the system based on the context, say so instead of trying to explain how to go somewhere else.
DEFAULT_RAW_METADATA=model_name:chat-bison,embedding_model_name:textembedding-gecko@003,model_temperature:0.3,max_output_tokens:512,num_similar_docs_to_find:7,similarity_score_threshold:0.6
DEFAULT_DESCRIPTION=Ask about available datasets, powered by public dataset metadata like study descriptions
# Additional topic configurations
ANOTHERTOPIC_DESCRIPTION=Ask about available datasets, powered by public dataset metadata like study descriptions
ANOTHERTOPIC_RAW_METADATA=model_name:gpt-3.5-turbo,model_temperature:0.45,num_similar_docs_to_find:6,similarity_score_threshold:0.75
ANOTHERTOPIC_SYSTEM_PROMPT=You answer questions about datasets that are available in the system. You'll be given relevant dataset descriptions for every dataset that's been ingested into the system. You are acting as a search assistant for a biomedical researcher (who will be asking you questions). The researcher is likely trying to find datasets of interest for a particular research question. You should recommend datasets that may be of interest to that researcher.
ANOTHERTOPIC_CHAIN_NAME=TopicChainOpenAiQuestionAnswerRAG
GEN3DOCS_SYSTEM_PROMPT=You will be given relevant context from all the public documentation surrounding an open source software called Gen3. You are acting as an assistant to a new Gen3 developer, who is going to ask a question. Try to answer their question based on the context, but know that some of the context may be out of date. Let the developer know where they can get more information if relevant and cite portions of the context.
GEN3DOCS_RAW_METADATA=model_name:chat-bison,embedding_model_name:textembedding-gecko-multilingual@001,model_temperature:0.5,max_output_tokens:512,num_similar_docs_to_find:7,similarity_score_threshold:0.5
GEN3DOCS_DESCRIPTION=Ask about Gen3, powered by public markdown files in the UChicago Center for Translational Data Science's GitHub
########## Debugging and Logging Configurations ##########
# DEBUG makes the logging go from INFO to DEBUG
DEBUG=False
# VERBOSE_LLM_LOGS makes the logging for the chains much more verbose (useful to testing issues in the chain, but
# pretty noisy for any other time)
VERBOSE_LLM_LOGS=False
# DEBUG_SKIP_AUTH will COMPLETELY SKIP AUTHORIZATION for debugging purposes
DEBUG_SKIP_AUTH=False
The topic configurations are flexible to support arbitrary new names {{TOPIC NAME}}_SYSTEM_PROMPT
etc. See gen3discoveryai/config.py
for details.
In order to utilize the topic chains effectively, you likely need to store some data in the knowledge library. You can write your own script or utilize the following. This script currently supports loading from arbitrary TSVs or Markdown files in a directory.
IMPORTANT: Make sure when using
/bin
scripts, the.env
service configuration is set up and appropriately loaded (e.g. execute the script from a directory where there is a.env
config). The/bin
scripts REQUIRE loading the configuration in order to both load the available topics and to properly embed and load into the vectorstore.
Here's the knowledge load script which takes a single required argument, being a directory where TSVs are.
See other options with --help
:
poetry run python ./bin/load_into_knowledge_store.py tsvs --help
NOTE: This expects that filenames for a specific topic start with that topic name. You can have multiple files per topic but they need to start with the topic name. You can also have nested directories, this will search recursively.
An example /tsvs
directory:
- default.tsv
- bdc/
- bdc1.tsv
- bdc2.tsv
Example run:
poetry run python ./bin/load_into_knowledge_store.py tsvs ./tsvs
If you're using this for Gen3 Metadata, you can easily download public metadata from Gen3 to a TSV and use that as input (see our Gen3 SDK Metadata functionality for details).
There's an example script that downloads all the public markdown
files from our GitHub org. You can reference
the bin/download_files_from_github.py
example script if interested.
Once you have Markdown files in a directory, you just need to use the
./bin/load_into_knowledge_store.py
utility and supply the directory and topic.
See available options:
poetry run python ./bin/load_into_knowledge_store.py markdown --help
NOTE: Unlike TSVs, loading from a Markdown directory requires specifying a single topic for all files in that directory.
Example run:
poetry run python ./bin/load_into_knowledge_store.py markdown --topic anothertopic ./bin/library
If loading from TSVs or Markdown doesn't work easily for you, you should be able to
easily modify the ./bin/load_into_knowledge_store.py
script to your needs by using a different langchain document loader.
The base TopicChain
class includes a store_knowledge
method which expects a list
of langchain
documents. This is the default output of
langchain.text_splitter.TokenTextSplitter
. Langchain has numerous document loaders that can be
fed into the splitter already, so check out the langchain documentation.
Install and run service locally:
poetry install
poetry run python run.py
Hit the API:
curl --location 'http://0.0.0.0:8089/ask/' \
--header 'Content-Type: application/json' \
--header 'Accept: application/json' \
--data '{"query": "Do you have COVID data?"}'
You can change the port in the
run.py
as needed
Ask for configured topics:
curl --location 'http://0.0.0.0:8089/topics/' \
--header 'Accept: application/json'
Relies on Gen3's Policy Engine.
- For
/topics
endpoints, requiresread
on/gen3_discovery_ai/topics
- For
/ask
endpoint, requiresread
on/gen3_discovery_ai/ask/{topic}
- For
/_version
endpoint, requiresread
on/gen3_discovery_ai/service_info/version
- For
/_status
endpoint, requiresread
on/gen3_discovery_ai/service_info/status
You can poetry run python run.py
after install to run the app locally.
For testing, you can poetry run pytest
.
The default pytest
options specified
in the pyproject.toml
additionally:
- runs coverage and will error if it falls below the threshold
- profiles using pytest-profiling which outputs into
/prof
This quick clean.sh
script is used to run isort
and black
over everything if
you don't integrate those with your editor/IDE.
NOTE: This requires the beginning of the setup for using Super Linter locally. You must have the global linter configs in
~/.gen3/.github/.github/linters
. See Gen3's linter setup docs.
clean.sh
also runs just pylint
to check Python code for lint.
Here's how you can run it:
./clean.sh
NOTE: GitHub's Super Linter runs more than just
pylint
so it's worth setting that up locally to run before pushing large changes. See Gen3's linter setup docs for full instructions. Then you can run pylint more frequently as you develop.
To build:
docker build -t gen3discoveryai:latest .
To run:
docker run --name gen3discoveryai \
--env-file "./.env" \
-v "$GOOGLE_APPLICATION_CREDENTIALS":"$GOOGLE_APPLICATION_CREDENTIALS" \
-p 8089:8089 \
gen3discoveryai:latest
To exec into a bash shell in running container:
docker exec -it gen3discoveryai bash
To kill and remove running container:
docker kill gen3discoveryai
docker remove gen3discoveryai