Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Artifact cleanup #1341

Merged
merged 92 commits into from
Nov 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
11e9c9c
Add source documents for verb tests
natoverse Oct 29, 2024
8acc4c2
Remove entity_type erroneous column
natoverse Oct 29, 2024
7bccb21
Add new test data
natoverse Oct 29, 2024
5edc1bb
Remove source/target degree columns
natoverse Oct 29, 2024
cea8d27
Remove top_level_node_id
natoverse Oct 29, 2024
dc2aca3
Remove chunk column configs
natoverse Oct 29, 2024
b9364df
Rename "chunk" to "text"
natoverse Oct 29, 2024
9e4d87b
Rename "chunk" to "text" in base
natoverse Oct 30, 2024
b3089f6
Re-map document input to use base text units
natoverse Oct 30, 2024
92e4673
Merge branch 'main' into artifact-cleanup
natoverse Oct 31, 2024
39d14dc
Revert base text units as final documents dep
natoverse Oct 31, 2024
9e7cd95
Update test data
natoverse Oct 31, 2024
07f4ff9
Split/rename node source_id
natoverse Oct 31, 2024
8787d6c
Drop node size (dup of degree)
natoverse Oct 31, 2024
9906b4b
Drop document_ids from covariates
natoverse Oct 31, 2024
f6d1687
Remove unused document_ids from models
natoverse Oct 31, 2024
54d25af
Remove n_tokens from covariate table
natoverse Oct 31, 2024
61586c3
Fix missed document_ids delete
natoverse Oct 31, 2024
9eaa389
Wire base text units to final documents
natoverse Oct 31, 2024
1b6f0b0
Rename relationship rank as combined_degree
natoverse Oct 31, 2024
2deac6d
Add rank as first-class property to Relationship
natoverse Oct 31, 2024
faf4ad6
Remove split_text operation
natoverse Oct 31, 2024
81f3c1c
Fix relationships test parquet
natoverse Oct 31, 2024
3286ceb
Update test parquets
natoverse Oct 31, 2024
45283b1
Add entity ids to community table
natoverse Oct 31, 2024
86e29a6
Merge branch 'artifact-cleanup' of https://github.com/microsoft/graph…
natoverse Oct 31, 2024
37f9beb
Remove stored graph embedding columns
natoverse Oct 31, 2024
d29b8f4
Format
natoverse Oct 31, 2024
d0e94a5
Semver
natoverse Oct 31, 2024
977b47b
Fix JSON typo
natoverse Oct 31, 2024
faadc6d
Spelling
natoverse Oct 31, 2024
cb5b3e3
Rename lancedb
natoverse Oct 31, 2024
4139f8b
Sort lancedb
natoverse Oct 31, 2024
8a4f3d2
Fix unit test
natoverse Oct 31, 2024
e45fb1e
Merge branch 'main' into artifact-cleanup
natoverse Nov 2, 2024
1f972d9
Fix test to account for changing period
natoverse Nov 2, 2024
b6fb8c6
Update tests for separate embeddings
natoverse Nov 2, 2024
15ce567
Format
natoverse Nov 2, 2024
0e3f3f0
Better assertion printing
natoverse Nov 2, 2024
70f0fce
Fix unit test for windows
natoverse Nov 4, 2024
ec2cadb
Rename document.raw_content -> document.text
natoverse Nov 4, 2024
f8562df
Remove read_documents function
natoverse Nov 4, 2024
204788d
Remove unused document summary from model
natoverse Nov 4, 2024
1de1bc0
Remove unused imports
natoverse Nov 4, 2024
b044b62
Format
natoverse Nov 4, 2024
d50578c
Merge branch 'main' into artifact-cleanup
natoverse Nov 5, 2024
33c5d20
Merge branch 'main' into artifact-cleanup
natoverse Nov 5, 2024
f166d7d
Merge branch 'main' into artifact-cleanup
natoverse Nov 5, 2024
70ac842
Add new snapshots to default init
natoverse Nov 5, 2024
3f50541
Merge branch 'main' into artifact-cleanup
natoverse Nov 5, 2024
f206eb7
Use util to construct embeddings collection name
natoverse Nov 5, 2024
7f3d2e2
Merge branch 'main' into artifact-cleanup
natoverse Nov 6, 2024
6f61137
Merge branch 'main' into artifact-cleanup
natoverse Nov 6, 2024
59e044b
Align inc index model with branch changes
natoverse Nov 7, 2024
1c30f2c
Merge branch 'main' into artifact-cleanup
natoverse Nov 7, 2024
d9c229f
Merge branch 'main' into artifact-cleanup
natoverse Nov 8, 2024
568b978
Update data and tests for int ids
natoverse Nov 8, 2024
d59291a
Clean up embedding locs
natoverse Nov 8, 2024
f115901
Switch entity "name" to "title" for consistency
natoverse Nov 11, 2024
10a3090
Merge branch 'main' into artifact-cleanup
natoverse Nov 11, 2024
ffa09af
Fix short_id -> human_readable_id defaults
natoverse Nov 11, 2024
5fc32ae
Format
natoverse Nov 11, 2024
8886faf
Rework community IDs
natoverse Nov 11, 2024
3af37a3
Fix community size compute
natoverse Nov 11, 2024
bae6e52
Fix unit tests
natoverse Nov 11, 2024
0d9084c
Fix report read
natoverse Nov 11, 2024
054735a
Pare down nodes table output
natoverse Nov 11, 2024
7f2dea9
Fix unit test
natoverse Nov 11, 2024
75925c9
Merge branch 'main' into artifact-cleanup
natoverse Nov 12, 2024
8a5f3ca
Fix merge
natoverse Nov 12, 2024
465fb69
Fix community loading
natoverse Nov 12, 2024
b879e45
Format
natoverse Nov 12, 2024
914d59c
Fix community id report extraction
natoverse Nov 12, 2024
dd08fe6
Update tests
natoverse Nov 12, 2024
f97caf1
Consistent short IDs and ordering
natoverse Nov 12, 2024
d60a408
Update ordering and tests
natoverse Nov 12, 2024
09f7179
Update incremental for new nodes model
natoverse Nov 12, 2024
ef5700b
Guard document columns loc
natoverse Nov 12, 2024
965e47c
Match column ordering
natoverse Nov 12, 2024
effd06b
Fix document guard
natoverse Nov 12, 2024
6f5ab84
Update smoke tests
natoverse Nov 12, 2024
4cdaf71
Fill NA on community extract
natoverse Nov 12, 2024
a8a94d8
Logging for smoke test debug
natoverse Nov 12, 2024
daacd1b
Add parquet schema details doc
natoverse Nov 12, 2024
9083028
Fix community hierarchy guard
natoverse Nov 12, 2024
be47260
Use better empty hierarchy guard
natoverse Nov 13, 2024
8539700
Back-compat shims
natoverse Nov 13, 2024
7c3493a
Semver
natoverse Nov 13, 2024
82d961c
Fix warning
natoverse Nov 13, 2024
9534ab0
Format
natoverse Nov 13, 2024
aec3d9f
Remove default fallback
natoverse Nov 13, 2024
96956fb
Reuse key
natoverse Nov 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .semversioner/next-release/minor-20241113010525824646.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "minor",
"description": "Data model changes."
}
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20241031230557819462.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "Cleanup of artifact outputs/schemas."
}
4 changes: 2 additions & 2 deletions docs/config/env_vars.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ If the embedding target is `all`, and you want to only embed a subset of these f
### Embedded Fields

- `text_unit.text`
- `document.raw_content`
- `entity.name`
- `document.text`
- `entity.title`
- `entity.description`
- `relationship.description`
- `community.title`
Expand Down
2 changes: 1 addition & 1 deletion docs/examples_notebooks/drift_search.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@
"# load description embeddings to an in-memory lancedb vectorstore\n",
"# to connect to a remote db, specify url and port values.\n",
"description_embedding_store = LanceDBVectorStore(\n",
" collection_name=\"entity_description_embeddings\",\n",
" collection_name=\"default-entity-description\",\n",
")\n",
"description_embedding_store.connect(db_uri=LANCEDB_URI)\n",
"entity_description_embeddings = store_entity_semantic_embeddings(\n",
Expand Down
2 changes: 1 addition & 1 deletion docs/examples_notebooks/local_search.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@
"# load description embeddings to an in-memory lancedb vectorstore\n",
"# to connect to a remote db, specify url and port values.\n",
"description_embedding_store = LanceDBVectorStore(\n",
" collection_name=\"entity.description\",\n",
" collection_name=\"default-entity-description\",\n",
")\n",
"description_embedding_store.connect(db_uri=LANCEDB_URI)\n",
"entity_description_embeddings = store_entity_semantic_embeddings(\n",
Expand Down
3 changes: 2 additions & 1 deletion docs/index/default_dataflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ The knowledge model is a specification for data outputs that conform to our data
- `Entity` - An entity extracted from a TextUnit. These represent people, places, events, or some other entity-model that you provide.
- `Relationship` - A relationship between two entities. These are generated from the covariates.
- `Covariate` - Extracted claim information, which contains statements about entities which may be time-bound.
- `Community Report` - Once entities are generated, we perform hierarchical community detection on them and generate reports for each community in this hierarchy.
- `Community` - Once the graph of entities and relationships is built, we perform hierarchical community detection on them to create a clustering structure.
- `Community Report` - The contents of each community are summarized into a generated report, useful for human reading and downstream search.
- `Node` - This table contains layout information for rendered graph-views of the Entities and Documents which have been embedded and clustered.

## The Default Configuration Workflow
Expand Down
89 changes: 89 additions & 0 deletions docs/index/outputs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Outputs

The default pipeline produces a series of output tables that align with the [conceptual knowledge model](../index/default_dataflow.md). This page describes the detailed output table schemas. By default we write these tables out as parquet files on disk.

## Shared fields
All tables have two identifier fields:
- id: str - Generated UUID, assuring global uniqueness
- human_readable_id: int - This is an incremented short ID created per-run. For example, we use this short ID with generated summaries that print citations so they are easy to cross-reference visually.

## create_final_communities
This is a list of the final communities generated by Leiden. Communities are strictly hierarchical, subdividing into children as the cluster affinity is narrowed.
- community: int - Leiden-generated cluster ID for the community. Note that these increment with depth, so they are unique through all levels of the community hierarchy. For this table, human_readable_id is a copy of the community ID rather than a plain increment.
- level: int - Depth of the community in the hierarchy.
- title: str - Friendly name of the community.
- entity_ids - List of entities that are members of the community.
- relationship_ids - List of relationships that are wholly within the community (source and target are both in the community).
- text_unit_ids - List of text units represented within the community.
- period - Date of ingest, used for incremental update merges.
- size - Size of the community (entity count), used for incremental update merges.

## create_final_community_reports
This is the list of summarized reports for each community.
- community: int - Short ID of the community this report applies to.
- level: int - Level of the community this report applies to.
- title: str - LM-generated title for the report.
- summary: str - LM-generated summary of the report.
- full_content: str - LM-generated full report.
- rank: float - LM-derived relevance ranking of the report based on member entity salience
- rank_explanation - LM-derived explanation of the rank.
- findings: dict - LM-derived list of the top 5-10 insights from the community. Contains `summary` and `explanation` values.
- full_content_json - Full JSON output as returned by the LM. Most fields are extracted into columns, but this JSON is sent for query summarization so we leave it to allow for prompt tuning to add fields/content by end users.
- period - Date of ingest, used for incremental update merges.
- size - Size of the community (entity count), used for incremental update merges.

## create_final_covariates
(Optional) If claim extraction is turned on, this is a list of the extracted covariates. Note that claims are typically oriented around identifying malicious behavior such as fraud, so they are not useful for all datasets.
- covariate_type: str - This is always "claim" with our default covariates.
- type: str - Nature of the claim type.
- description: str - LM-generated description of the behavior.
- subject_id: str - Name of the source entity (that is performing the claimed behavior).
- object_id: str - Name of the target entity (that the claimed behavior is performed on).
- status: str [TRUE, FALSE, SUSPECTED] - LM-derived assessment of the correctness of the claim.
- start_date: str (ISO8601) - LM-derived start of the claimed activity.
- end_date: str (ISO8601) - LM-derived end of the claimed activity.
- source_text: str - Short string of text containing the claimed behavior.
- text_unit_id: str - ID of the text unit the claim text was extracted from.

## create_final_documents
List of document content after import.
- title: str - Filename, unless otherwise configured during CSV import.
- text: str - Full text of the document.
- text_unit_ids: str[] - List of text units (chunks) that were parsed from the document.
- attributes: dict (optional) - If specified during CSV import, this is a dict of attributes for the document.

# create_final_entities
List of all entities found in the data by the LM.
- title: str - Name of the entity.
- type: str - Type of the entity. By default this will be "organization", "person", "geo", or "event" unless configured differently or auto-tuning is used.
- description: str - Textual description of the entity. Entities may be found in many text units, so this is an LM-derived summary of all descriptions.
- text_unit_ids: str[] - List of the text units containing the entity.

# create_final_nodes
This is graph-related information for the entities. It contains only information relevant to the graph such as community. There is an entry for each entity at every community level it is found within, so you may see "duplicate" entities.

Note that the ID fields match those in create_final_entities and can be used for joining if additional information about a node is required.
- title: str - Name of the referenced entity. Duplicated from create_final_entities for convenient cross-referencing.
- community: int - Leiden community the node is found within. Entities are not always assigned a community (they may not be close enough to any), so they may have a ID of -1.
- level: int - Level of the community the entity is in.
- degree: int - Node degree (connectedness) in the graph.
- x: float - X position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0.
- y: float - Y position of the node for visual layouts. If graph embeddings and UMAP are not turned on, this will be 0.

## create_final_relationships
List of all entity-to-entity relationships found in the data by the LM. This is also the _edge list_ for the graph.
- source: str - Name of the source entity.
- target: str - Name of the target entity.
- description: str - LM-derived description of the relationship. Also see note for entity descriptions.
- weight: float - Weight of the edge in the graph. This is summed from an LM-derived "strength" measure for each relationship instance.
- combined_degree: int - Sum of source and target node degrees.
- text_unit_ids: str[] - List of text units the relationship was found within.

## create_final_text_units
List of all text chunks parsed from the input documents.
- text: str - Raw full text of the chunk.
- n_tokens: int - Number of tokens in the chunk. This should normally match the `chunk_size` config parameter, except for the last chunk which is often shorter.
- document_ids: str[] - List of document IDs the chunk came from. This is normally only 1 due to our default groupby, but for very short text documents (e.g., microblogs) it can be configured so text units span multiple documents.
- entity_ids: str[] - List of entities found in the text unit.
- relationships_ids: str[] - List of relationships found in the text unit.
- covariate_ids: str[] - Optional list of covariates found in the text unit.
Original file line number Diff line number Diff line change
Expand Up @@ -299,7 +299,7 @@
"entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)\n",
"\n",
"description_embedding_store = LanceDBVectorStore(\n",
" collection_name=\"entity.description\",\n",
" collection_name=\"default-entity-description\",\n",
")\n",
"description_embedding_store.connect(db_uri=LANCEDB_URI)\n",
"entity_description_embeddings = store_entity_semantic_embeddings(\n",
Expand Down
34 changes: 20 additions & 14 deletions graphrag/api/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@
from pydantic import validate_call

from graphrag.config import GraphRagConfig
from graphrag.index.config.embeddings import (
community_full_content_embedding,
entity_description_embedding,
)
from graphrag.logging import PrintProgressReporter
from graphrag.query.factories import (
get_drift_search_engine,
Expand All @@ -42,6 +46,7 @@
)
from graphrag.query.structured_search.base import SearchResult # noqa: TCH001
from graphrag.utils.cli import redact
from graphrag.utils.embeddings import create_collection_name
from graphrag.vector_stores import VectorStoreFactory, VectorStoreType
from graphrag.vector_stores.base import BaseVectorStore

Expand Down Expand Up @@ -228,7 +233,7 @@ async def local_search(

description_embedding_store = _get_embedding_store(
config_args=vector_store_args, # type: ignore
container_suffix="entity-description",
embedding_name=entity_description_embedding,
)

_entities = read_indexer_entities(nodes, entities, community_level)
Expand Down Expand Up @@ -302,7 +307,7 @@ async def local_search_streaming(

description_embedding_store = _get_embedding_store(
config_args=vector_store_args, # type: ignore
container_suffix="entity-description",
embedding_name=entity_description_embedding,
)

_entities = read_indexer_entities(nodes, entities, community_level)
Expand Down Expand Up @@ -385,12 +390,12 @@ async def drift_search(

description_embedding_store = _get_embedding_store(
config_args=vector_store_args, # type: ignore
container_suffix="entity-description",
embedding_name=entity_description_embedding,
)

full_content_embedding_store = _get_embedding_store(
config_args=vector_store_args, # type: ignore
container_suffix="community-full_content",
embedding_name=community_full_content_embedding,
)

_entities = read_indexer_entities(nodes, entities, community_level)
Expand Down Expand Up @@ -450,7 +455,10 @@ def _patch_vector_store(
}
description_embedding_store = LanceDBVectorStore(
db_uri=config.embeddings.vector_store["db_uri"],
collection_name="default-entity-description",
collection_name=create_collection_name(
config.embeddings.vector_store["container_name"],
entity_description_embedding,
),
overwrite=config.embeddings.vector_store["overwrite"],
)
description_embedding_store.connect(
Expand All @@ -469,11 +477,7 @@ def _patch_vector_store(
from graphrag.vector_stores.lancedb import LanceDBVectorStore

community_reports = with_reports
collection_name = (
config.embeddings.vector_store.get("container_name", "default")
if config.embeddings.vector_store
else "default"
)
container_name = config.embeddings.vector_store["container_name"]
# Store report embeddings
_reports = read_indexer_reports(
community_reports,
Expand All @@ -485,7 +489,9 @@ def _patch_vector_store(

full_content_embedding_store = LanceDBVectorStore(
db_uri=config.embeddings.vector_store["db_uri"],
collection_name=f"{collection_name}-community-full_content",
collection_name=create_collection_name(
container_name, community_full_content_embedding
),
overwrite=config.embeddings.vector_store["overwrite"],
)
full_content_embedding_store.connect(
Expand All @@ -501,12 +507,12 @@ def _patch_vector_store(

def _get_embedding_store(
config_args: dict,
container_suffix: str,
embedding_name: str,
) -> BaseVectorStore:
"""Get the embedding description store."""
vector_store_type = config_args["type"]
collection_name = (
f"{config_args.get('container_name', 'default')}-{container_suffix}"
collection_name = create_collection_name(
config_args.get("container_name", "default"), embedding_name
)
embedding_store = VectorStoreFactory.get_vector_store(
vector_store_type=vector_store_type,
Expand Down
8 changes: 4 additions & 4 deletions graphrag/index/config/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@
community_full_content_embedding,
community_summary_embedding,
community_title_embedding,
document_raw_content_embedding,
document_text_embedding,
entity_description_embedding,
entity_name_embedding,
entity_title_embedding,
relationship_description_embedding,
required_embeddings,
text_unit_text_embedding,
Expand Down Expand Up @@ -82,9 +82,9 @@
"community_full_content_embedding",
"community_summary_embedding",
"community_title_embedding",
"document_raw_content_embedding",
"document_text_embedding",
"entity_description_embedding",
"entity_name_embedding",
"entity_title_embedding",
"relationship_description_embedding",
"required_embeddings",
"text_unit_text_embedding",
Expand Down
8 changes: 4 additions & 4 deletions graphrag/index/config/embeddings.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,20 @@

"""A module containing embeddings values."""

entity_name_embedding = "entity.name"
entity_title_embedding = "entity.title"
entity_description_embedding = "entity.description"
relationship_description_embedding = "relationship.description"
document_raw_content_embedding = "document.raw_content"
document_text_embedding = "document.text"
community_title_embedding = "community.title"
community_summary_embedding = "community.summary"
community_full_content_embedding = "community.full_content"
text_unit_text_embedding = "text_unit.text"

all_embeddings: set[str] = {
entity_name_embedding,
entity_title_embedding,
entity_description_embedding,
relationship_description_embedding,
document_raw_content_embedding,
document_text_embedding,
community_title_embedding,
community_summary_embedding,
community_full_content_embedding,
Expand Down
6 changes: 2 additions & 4 deletions graphrag/index/flows/create_base_entity_graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,6 @@ async def create_base_entity_graph(
callbacks: VerbCallbacks,
cache: PipelineCache,
storage: PipelineStorage,
text_column: str,
id_column: str,
clustering_strategy: dict[str, Any],
extraction_strategy: dict[str, Any] | None = None,
extraction_num_threads: int = 4,
Expand All @@ -52,8 +50,8 @@ async def create_base_entity_graph(
text_units,
callbacks,
cache,
text_column=text_column,
id_column=id_column,
text_column="text",
id_column="id",
strategy=extraction_strategy,
async_mode=extraction_async_mode,
entity_types=entity_types,
Expand Down
19 changes: 7 additions & 12 deletions graphrag/index/flows/create_base_text_units.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,6 @@ async def create_base_text_units(
documents: pd.DataFrame,
callbacks: VerbCallbacks,
storage: PipelineStorage,
chunk_column_name: str,
n_tokens_column_name: str,
chunk_by_columns: list[str],
chunk_strategy: dict[str, Any] | None = None,
snapshot_transient_enabled: bool = False,
Expand Down Expand Up @@ -65,21 +63,18 @@ async def create_base_text_units(
chunked = chunked.explode("chunks")
chunked.rename(
columns={
"chunks": chunk_column_name,
"chunks": "chunk",
},
inplace=True,
)
chunked["chunk_id"] = chunked.apply(
lambda row: gen_md5_hash(row, [chunk_column_name]), axis=1
chunked["id"] = chunked.apply(lambda row: gen_md5_hash(row, ["chunk"]), axis=1)
chunked[["document_ids", "chunk", "n_tokens"]] = pd.DataFrame(
chunked["chunk"].tolist(), index=chunked.index
)
chunked[["document_ids", chunk_column_name, n_tokens_column_name]] = pd.DataFrame(
chunked[chunk_column_name].tolist(), index=chunked.index
)
chunked["id"] = chunked["chunk_id"]
# rename for downstream consumption
chunked.rename(columns={"chunk": "text"}, inplace=True)

output = cast(
pd.DataFrame, chunked[chunked[chunk_column_name].notna()].reset_index(drop=True)
)
output = cast(pd.DataFrame, chunked[chunked["text"].notna()].reset_index(drop=True))
natoverse marked this conversation as resolved.
Show resolved Hide resolved

if snapshot_transient_enabled:
await snapshot(
Expand Down
Loading
Loading