-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New workflow to generate embeddings in a single workflow #1296
Conversation
Please remember to rename the new workflow to remove "create_final" so it is more clear that it doesn't have a direct output. |
Please add a test under test_verbs that confirms all selected embeddings tables are created. You can test this by providing a storage instance in the test, and then confirming that all requested tables exist. You can also read any of the tables and assert that they have mock embeddings values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall comment, I don't like that we just remove the embeddings smoke tests. Can you please add tests to check for these outputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Column(s) ['description_embedding'] do not exist"
if embed_column not in input.columns: | ||
msg = f"Column {embed_column} not found in input dataframe with columns {input.columns}" | ||
raise ValueError(msg) | ||
title = title_column or embed_column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While trying to index, embed_text fails due to this change.
title
is re-initialised in the while loop below leading to a key error. Using some other variable name will fix it.
Description
New workflow to separate text embedding generation in a new single workflow to generate new parquet files with just id/embedding columns.
Proposed Changes
New workflow called: create_final_embeddings.
Remove all embeddings configuration in all other workflows.
Standardize text embed configurations for all embeddings.
New standard for parquet embeddings files:
<original_table_name>_<source_column>_embeddings.parquet
with just id/embedding columns.