Public Preview Refresh Add MLIndex and DataIndex examples and documen…

…tion. (#2624) * Public Preview Refresh Add MLIndex and DataIndex examples and documention. * Rename chat-with-index internal code to src and apply various black formatting fixes. * Rename pup_refresh to code_first. * Remove artifacts produced by local examples. * Address comments. --------- Co-authored-by: Lucas Pickup <[email protected]>
Azure · Sep 8, 2023 · 06786a0 · 06786a0
1 parent fbfe7fc
commit 06786a0
Show file tree

Hide file tree

Showing 47 changed files with 1,912 additions and 0 deletions.
diff --git a/sdk/python/generative-ai/rag/code_first/README.md b/sdk/python/generative-ai/rag/code_first/README.md
@@ -0,0 +1,87 @@
+# AzureML MLIndex Asset creation
+
+MLIndex assets in AzureML represent a model used to generate embeddings from text and an index which can be searched using embedding vectors.
+Read more about their structure [here](./docs/mlindex.md).
+
+## Pre-requisites
+
+0. Install `azure-ai-ml` and `azureml-rag`:
+    - `pip install 'azure-ai-ml==1.10.0a20230825006' --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/`
+    - `pip install -U 'azureml-rag[document_parsing,faiss,cognitive_search]>=0.2.0'`
+1. You have unstructured data.
+    - In one of [AzureMLs supported data sources](https://learn.microsoft.com/azure/machine-learning/concept-data?view=azureml-api-2): Blob, ADLSgen2, OneLake, S3, Git
+    - In any of these supported file formats: md, txt, py, pdf, ppt(x), doc(x)
+2. You have an embedding model.
+    - [Create an Azure OpenAI service + connection](https://learn.microsoft.com/azure/machine-learning/prompt-flow/concept-connections?view=azureml-api-2)
+    - Use a HuggingFace `sentence-transformer` model (you can just use it now, to leverage the MLIndex in PromptFlow a [Custom Runtime](https://promptflow.azurewebsites.net/how-to-guides/how-to-customize-environment-runtime.html) will be required)
+3. You have an Index to ingest data to.
+    - [Create an Azure Cognitive Search service + connection](https://learn.microsoft.com/azure/machine-learning/prompt-flow/concept-connections?view=azureml-api-2)
+    - Use a Faiss index (you can just use it now)
+
+## Let's Ingest and Index
+
+A DataIndex job is configured using the `azure-ai-ml` python sdk/cli, either directly in code or with a yaml file.
+
+### SDK
+
+The examples are runnable as Python scripts, assuming the pre-requisites have been acquired and configured in the script.  
+Opening them in vscode enables executing each block below a `# %%` comment like a jupyter notebook cell.
+
+#### Cloud Creation
+
+##### Process this documentation using Azure OpenAI and Azure Cognitive Search
+
+- [local_docs_to_acs_mlindex.py](./data_index_job/local_docs_to_acs_mlindex.py)
+
+##### Index data from S3 using OneLake
+
+- [s3_to_acs_mlindex.py](./data_index_job/s3_to_acs_mlindex.py)
+- [scheduled_s3_to_asc_mlindex.py](./data_index_job/scheduled_s3_to_asc_mlindex.py)
+
+##### Ingest Azure Search docs from GitHub into a Faiss Index
+
+- [cog_search_docs_faiss_mlindex.py](./data_index_job/cog_search_docs_faiss_mlindex.py)
+
+#### Local Creation
+
+##### Process this documentation using Azure OpenAI and Azure Cognitive Search
+
+- [local_docs_to_acs_aoai_mlindex.py](./mlindex_local/local_docs_to_acs_aoai_mlindex.py)
+
+##### Process this documentation using SentenceTransformers and Faiss
+
+- [local_docs_to_faiss_mlindex.py](./mlindex_local/local_docs_to_faiss_mlindex.py)
+- [local_docs_to_faiss_mlindex_with_promptflow.py](./mlindex_local/local_docs_to_faiss_mlindex_with_promptflow.py)
+    - Learn more about [Promptflow here](https://microsoft.github.io/promptflow/)
+
+##### Use a Langchain Documents to create an Index
+
+- [langchain_docs_to_mlindex.py](./mlindex_local/langchain_docs_to_mlindex.py)
+
+## Using the MLIndex asset
+
+More information about how to use MLIndex in various places [here]().
+
+## Appendix
+
+### Which Embeddings Model to use?
+
+There are currently two supported Embedding options: OpenAI's `text-embedding-ada-002` embedding model or HuggingFace embedding models. Here are some factors that might influence your decision:
+
+#### OpenAI
+
+OpenAI has [great documentation](https://platform.openai.com/docs/guides/embeddings) on their Embeddings model `text-embedding-ada-002`, it can handle up to 8191 tokens and can be accessed using [Azure OpenAI](https://learn.microsoft.com/azure/cognitive-services/openai/concepts/models#embeddings-models) or OpenAI directly.
+If you have an existing Azure OpenAI Instance you can connect it to AzureML, if you don't AzureML provisions a default one for you called `Default_AzureOpenAI`.
+The main limitation when using `text-embedding-ada-002` is cost/quota available for the model. Otherwise it provides high quality embeddings across a wide array of text domains while being simple to use.
+
+#### HuggingFace
+
+HuggingFace hosts many different models capable of embedding text into single-dimensional vectors. The [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) ranks the performance of embeddings models on a few axis, not all models ranked can be run locally (e.g. `text-embedding-ada-002` is on the list), though many can and there is a range of larger and smaller models. When embedding with HuggingFace the model is loaded locally for inference, this will potentially impact your choice of compute resources.
+
+**NOTE:** The default PromptFlow Runtime does not come with HuggingFace model dependencies installed, Indexes created using HuggingFace embeddings will not work in PromptFlow by default. **Pick OpenAI if you want to use PromptFlow**
+
+### Setting up OneLake and S3
+
+[Create a lakehouse with OneLake](https://learn.microsoft.com/fabric/onelake/create-lakehouse-onelake)
+
+[Setup a shortcut to S3](https://learn.microsoft.com/fabric/onelake/create-s3-shortcut)
diff --git a/sdk/python/generative-ai/rag/code_first/data_index_job/cog_search_docs_faiss_mlindex.py b/sdk/python/generative-ai/rag/code_first/data_index_job/cog_search_docs_faiss_mlindex.py
@@ -0,0 +1,173 @@
+# %%[markdown]
+# # Local Documents to Azure Cognitive Search Index
+
+# %% Prerequisites
+# %pip install 'azure-ai-ml==1.10.0a20230825006' --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/
+# %pip install 'azureml-rag[faiss]>=0.2.0'
+# %pip install 'promptflow[azure]' promptflow-tools promptflow-vectordb
+
+# %% Authenticate to you AzureML Workspace, download a `config.json` from the top right hand corner menu of the Workspace.
+from azure.ai.ml import MLClient
+from azure.identity import DefaultAzureCredential
+
+ml_client = MLClient.from_config(
+    credential=DefaultAzureCredential(), path="config.json"
+)
+
+# %% Create DataIndex configuration
+from azureml.rag.dataindex.entities import (
+    Data,
+    DataIndex,
+    IndexSource,
+    CitationRegex,
+    Embedding,
+    IndexStore,
+)
+
+asset_name = "azure_search_docs_aoai_faiss"
+
+data_index = DataIndex(
+    name=asset_name,
+    description="Azure Cognitive Search docs embedded with text-embedding-ada-002 and indexed in a Faiss Index.",
+    source=IndexSource(
+        input_data=Data(
+            type="uri_folder",
+            path="<This will be replaced later>",
+        ),
+        input_glob="articles/search/**/*",
+        citation_url="https://learn.microsoft.com/en-us/azure",
+        # Remove articles from the final citation url and remove the file extension so url points to hosted docs, not GitHub.
+        citation_url_replacement_regex=CitationRegex(
+            match_pattern="(.*)/articles/(.*)(\\.[^.]+)$", replacement_pattern="\\1/\\2"
+        ),
+    ),
+    embedding=Embedding(
+        model="text-embedding-ada-002",
+        connection="azureml-rag-oai",
+        cache_path=f"azureml://datastores/workspaceblobstore/paths/embeddings_cache/{asset_name}",
+    ),
+    index=IndexStore(type="faiss"),
+    # name is replaced with a unique value each time the job is run
+    path=f"azureml://datastores/workspaceblobstore/paths/indexes/{asset_name}/{{name}}",
+)
+
+# %% Use git_clone Component to clone Azure Search docs from github
+ml_registry = MLClient(credential=ml_client._credential, registry_name="azureml")
+
+git_clone_component = ml_registry.components.get("llm_rag_git_clone", label="latest")
+
+# %% Clone Git Repo and use as input to index_job
+from azure.ai.ml.dsl import pipeline
+from azureml.rag.dataindex.data_index import index_data
+
+
+@pipeline(default_compute="serverless")
+def git_to_faiss(
+    git_url,
+    branch_name="",
+    git_connection_id="",
+):
+    git_clone = git_clone_component(git_repository=git_url, branch_name=branch_name)
+    git_clone.environment_variables[
+        "AZUREML_WORKSPACE_CONNECTION_ID_GIT"
+    ] = git_connection_id
+
+    index_job = index_data(
+        description=data_index.description,
+        data_index=data_index,
+        input_data_override=git_clone.outputs.output_data,
+        ml_client=ml_client,
+    )
+
+    return index_job.outputs
+
+
+# %%
+git_index_job = git_to_faiss("https://github.com/MicrosoftDocs/azure-docs.git")
+# Ensure repo cloned each run to get latest, comment out to have first clone reused.
+git_index_job.settings.force_rerun = True
+
+# %% Submit the DataIndex Job
+git_index_run = ml_client.jobs.create_or_update(
+    git_index_job,
+    experiment_name=asset_name,
+)
+git_index_run
+
+# %% Wait for it to finish
+ml_client.jobs.stream(git_index_run.name)
+
+# %% Check the created asset, it is a folder on storage containing an MLIndex yaml file
+mlindex_docs_index_asset = ml_client.data.get(asset_name, label="latest")
+mlindex_docs_index_asset
+
+# %% Try it out with langchain by loading the MLIndex asset using the azureml-rag SDK
+from azureml.rag.mlindex import MLIndex
+
+mlindex = MLIndex(mlindex_docs_index_asset)
+
+index = mlindex.as_langchain_vectorstore()
+docs = index.similarity_search("How can I enable Semantic Search on my Index?", k=5)
+docs
+
+# %% Take a look at those chunked docs
+import json
+
+for doc in docs:
+    print(json.dumps({"content": doc.page_content, **doc.metadata}, indent=2))
+
+# %% Try it out with Promptflow
+
+import promptflow
+
+pf = promptflow.PFClient()
+
+# %% List all the available connections
+for c in pf.connections.list():
+    print(c.name + " (" + c.type + ")")
+
+# %% Load index qna flow
+from pathlib import Path
+
+flow_path = Path.cwd().parent / "flows" / "bring_your_own_data_chat_qna"
+mlindex_path = mlindex_docs_index_asset.path
+
+# %% Put MLIndex uri into Vector DB Lookup tool inputs in [bring_your_own_data_chat_qna/flow.dag.yaml](../flows/bring_your_own_data_chat_qna/flow.dag.yaml)
+import re
+
+with open(flow_path / "flow.dag.yaml", "r") as f:
+    flow_yaml = f.read()
+    flow_yaml = re.sub(
+        r"path: (.*)# Index uri", f"path: {mlindex_path} # Index uri", flow_yaml, re.M
+    )
+with open(flow_path / "flow.dag.yaml", "w") as f:
+    f.write(flow_yaml)
+
+# %% Run qna flow
+output = pf.flows.test(
+    flow_path,
+    inputs={
+        "chat_history": [],
+        "chat_input": "How recently was Vector Search support added to Azure Cognitive Search?",
+    },
+)
+
+chat_output = output["chat_output"]
+for part in chat_output:
+    print(part, end="")
+
+# %% Run qna flow with multiple inputs
+data_path = Path.cwd().parent / "flows" / "data" / "azure_search_docs_questions.jsonl"
+
+column_mapping = {
+    "chat_history": "${data.chat_history}",
+    "chat_input": "${data.chat_input}",
+    "chat_output": "${data.chat_output}",
+}
+run = pf.run(flow=flow_path, data=data_path, column_mapping=column_mapping)
+pf.stream(run)
+
+print(f"{run}")
+
+
+# %%
diff --git a/sdk/python/generative-ai/rag/code_first/data_index_job/local_docs_to_acs_mlindex.py b/sdk/python/generative-ai/rag/code_first/data_index_job/local_docs_to_acs_mlindex.py
@@ -0,0 +1,45 @@
+# %%[markdown]
+# # Local Documents to Azure Cognitive Search Index
+
+# %% Prerequisites
+# %pip install 'azure-ai-ml==1.10.0a20230825006' --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/
+# %pip install 'azureml-rag[cognitive_search]>=0.2.0'
+
+# %% Authenticate to you AzureML Workspace, download a `config.json` from the top right hand corner menu of the Workspace.
+from azure.ai.ml import MLClient, load_data
+from azure.identity import DefaultAzureCredential
+
+ml_client = MLClient.from_config(
+    credential=DefaultAzureCredential(), path="config.json"
+)
+
+# %% Load DataIndex configuration from file
+data_index = load_data("local_docs_to_acs_mlindex.yaml")
+print(data_index)
+
+# %% Submit the DataIndex Job
+index_job = ml_client.data.index_data(data_index=data_index)
+
+# %% Wait for it to finish
+ml_client.jobs.stream(index_job.name)
+
+# %% Check the created asset, it is a folder on storage containing an MLIndex yaml file
+mlindex_docs_index_asset = ml_client.data.get(data_index.name, label="latest")
+mlindex_docs_index_asset
+
+# %% Try it out with langchain by loading the MLIndex asset using the azureml-rag SDK
+from azureml.rag.mlindex import MLIndex
+
+mlindex = MLIndex(mlindex_docs_index_asset)
+
+index = mlindex.as_langchain_vectorstore()
+docs = index.similarity_search("What is an MLIndex?", k=5)
+docs
+
+# %% Take a look at those chunked docs
+import json
+
+for doc in docs:
+    print(json.dumps({"content": doc.page_content, **doc.metadata}, indent=2))
+
+# %% Try it out with Promptflow
diff --git a/sdk/python/generative-ai/rag/code_first/data_index_job/local_docs_to_acs_mlindex.yaml b/sdk/python/generative-ai/rag/code_first/data_index_job/local_docs_to_acs_mlindex.yaml
@@ -0,0 +1,23 @@
+$schema: http://azureml/sdk-2-0/DataIndex.json
+type: uri_folder
+name: mlindex_docs_aoai_acs
+description: Python embedded with text-embedding-ada-002 and indexed in Azure Cognitive Search.
+
+source:
+    input_data:
+      type: uri_folder
+      path: ../
+    chunk_size: 200
+    citation_url: 'https://github.com/Azure/azureml-examples/tree/main/sdk/python/generative-ai/rag/refresh'
+
+embedding:
+    model: azure_open_ai://deployment/text-embedding-ada-002/model/text-embedding-ada-002
+    connection: azureml-rag-oai
+    cache_path: azureml://datastores/workspaceblobstore/paths/embeddings_cache/mlindex_docs_aoai_acs
+
+index:
+    type: acs
+    connection: azureml:azureml-rag-acs
+    name: mlindex_docs_aoai
+
+path: azureml://datastores/workspaceblobstore/paths/indexes/mlindex_docs_aoai_acs/{name}