Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New workflow to generate embeddings in a single workflow #1296

Merged
merged 56 commits into from
Nov 1, 2024
Merged

Conversation

gaudyb
Copy link
Contributor

@gaudyb gaudyb commented Oct 18, 2024

Description

New workflow to separate text embedding generation in a new single workflow to generate new parquet files with just id/embedding columns.

Proposed Changes

New workflow called: create_final_embeddings.
Remove all embeddings configuration in all other workflows.
Standardize text embed configurations for all embeddings.
New standard for parquet embeddings files: <original_table_name>_<source_column>_embeddings.parquet with just id/embedding columns.

@gaudyb gaudyb requested review from a team as code owners October 18, 2024 20:34
graphrag/config/enums.py Show resolved Hide resolved
graphrag/index/flows/create_final_embeddings.py Outdated Show resolved Hide resolved
graphrag/index/flows/create_final_embeddings.py Outdated Show resolved Hide resolved
graphrag/index/run/profiling.py Outdated Show resolved Hide resolved
graphrag/index/workflows/v1/create_final_entities.py Outdated Show resolved Hide resolved
graphrag/index/workflows/v1/create_final_relationships.py Outdated Show resolved Hide resolved
@natoverse
Copy link
Collaborator

Please remember to rename the new workflow to remove "create_final" so it is more clear that it doesn't have a direct output.

@natoverse
Copy link
Collaborator

Please add a test under test_verbs that confirms all selected embeddings tables are created. You can test this by providing a storage instance in the test, and then confirming that all requested tables exist. You can also read any of the tables and assert that they have mock embeddings values.

Copy link
Contributor

@AlonsoGuevara AlonsoGuevara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall comment, I don't like that we just remove the embeddings smoke tests. Can you please add tests to check for these outputs?

@gaudyb gaudyb requested a review from natoverse October 31, 2024 18:02
graphrag/config/defaults.py Outdated Show resolved Hide resolved
graphrag/index/create_pipeline_config.py Outdated Show resolved Hide resolved
graphrag/index/flows/generate_text_embeddings.py Outdated Show resolved Hide resolved
graphrag/index/workflows/v1/create_final_entities.py Outdated Show resolved Hide resolved
graphrag/index/workflows/v1/create_final_relationships.py Outdated Show resolved Hide resolved
natoverse
natoverse previously approved these changes Oct 31, 2024
@natoverse natoverse merged commit 17658c5 into main Nov 1, 2024
15 checks passed
@natoverse natoverse deleted the new_workflow branch November 1, 2024 22:01
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Column(s) ['description_embedding'] do not exist"

if embed_column not in input.columns:
msg = f"Column {embed_column} not found in input dataframe with columns {input.columns}"
raise ValueError(msg)
title = title_column or embed_column

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While trying to index, embed_text fails due to this change.
title is re-initialised in the while loop below leading to a key error. Using some other variable name will fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants