Automatic metadata generation using genAI #1599

dlpzx · 2024-10-01T10:12:29Z

Problem statement

Is your feature request related to a problem? Please describe.
Current metadata creation processes in data.all are manual and time-consuming, leading to incomplete, inconsistent, and outdated metadata. Inconsistency in metadata across datasets makes it difficult to understand and compare the information. Incomplete metadata reduces the value and usability of the data, while outdated metadata can hinder the ability to properly utilize the datasets. Additionally, the quality of manual metadata can vary significantly from dataset to dataset, depending on the data producer's expertise and available time and resources. Crucially, the burden of this undifferentiated heavy lifting falls on data producers, who must spend valuable time and resources on manual metadata creation instead of focusing on their core business problems.

The automated metadata recommendation feature can address these challenges by leveraging GenAI techniques, the metadata recommendation process can be streamlined, standardized, and kept up-to-date. This feature tries to solve the pain point of inconsistent, incomplete, and outdated metadata that exists due to manual approaches. This feature aims to improve metadata quality and consistency across data.all, while freeing producers to focus on their core competencies.

User Stories

Describe the solution you'd like

US1.

As a Data Producer, I want automated metadata recommendation for data.all datasets, including but not limited to dataset description, tags, topics, table description and column description, so that I can ensure datasets are discoverable and well-documented without manual effort.

Acceptance Criteria

Data producer created or imports a a new data.all dataset, once created, the user can generate relevant metadata automatically and display it, including dataset description, topics, tags, table description and column description
Data producer can use automated metadata recommendation for backward compatibility of existing datasets and republish the dataset without any extra steps.

US2.

As a data producer, I want the ability to run the automated metadata recommendation feature on demand, so that I can keep the data catalog information up-to-date as my data assets evolve.

Acceptance Criteria:

Data.all provides a one click interface for data producers to initiate on-demand automated metadata recommendation and updating for selected data.all dataset.
The updated metadata is reflected in the data.all catalog, allowing for user review and acceptance of the changes before they are persisted.

US3.

As a Data Producer, I want the ability to review, edit, and annotate automatically recommended metadata, so that I can ensure its accuracy and relevance while leveraging the automated process.

Acceptance Criteria:

Users can review and manually edit the AI-generated metadata before accepting it, with the accepted changes reflected in the metadata view of the data.all datasets.

US4.

As a Data Consumer, I want to use advanced search and filtering options based on enriched metadata to find relevant datasets quickly and efficiently.

Acceptance Criteria:

The interface allows for searching and filtering based on various metadata attributes

US5.

As a data.all developer and maintainer, I want the automated metadata recommendation feature to be secure and respect data governance access permissions.

Acceptance Criteria:

The automated metadata recommendation employs a least privilege model to limit permissions and access and complies with data.all's security posture.
The automated metadata recommendation is available to only authenticated data.all users and has the same data access permission as the user, only the data owner can generate metadata or update metadata.

###US6.
As a data.all developer and maintainer, I want the automated metadata recommendation feature to be configurable, scalable, reliable, and seamlessly integrated into the data.all platform, so that I can ensure a smooth and efficient user experience for all data.all users.

Acceptance Criteria:

The automated metadata recommendation is modularized and can be turned on and off.
The automated metadata recommendation can support a high load of requests and efficiently manages calls to models.
The automated metadata recommendation is seamlessly integrated into the data.all user interface in the dataset view without significant changes in user experience

US7.

As a data.all developer and maintainer, I want to be able to configure rate limits for the automated metadata recommendation feature so that I can prevent overuse and ensure responsible access to the feature.

Acceptance Criteria:

Maintainers can set thresholds for daily use metrics like number of times the automated metadata recommendation can be executed per user.
Once a user hits the configured threshold, the automated metadata recommendation feature will provide notifications to the user when they reach the usage limits and will be restricted for that user to regenerate the metadata until the next day

###US8.
As a data.all developer and maintainer, I want the automated metadata recommendation feature to clearly display a disclaimer about the limitations and confidentiality of the responses, so that I understand the context and boundaries of the AI-generated information.

Acceptance Criteria:

The automated metadata recommendation feature UI always presents a disclaimer that cannot be easily missed by the user and states: Carefully review this AI-generated response for accuracy ......”

US9.

As a data.all developer and maintainer, I want the automated metadata recommendation feature to provide feedback functionality so that users can easily indicate if the response was helpful or not, which can then be used to improve the quality of future responses.

Acceptance Criteria:

The automated metadata recommendation feature includes a thumbs up/down widget that users can click to provide feedback on the response which is captured and used to refine and improve the automated metadata recommendation responses over time.
Users receive a confirmation message after providing feedback, assuring them that their input will be used to enhance the feature.

Scope

1/ Metadata Generation:

Implement a 'Generate Metadata' or 'AI Icon' action button that data producers can access after creating a new dataset or importing an existing dataset into data.all.
When the "Generate Metadata" action is triggered, the system should automatically generate the metadata including Dataset description, Table descriptions, Column descriptions etc
Allow data producers to select specific tables and/or folders within a dataset for which they want to generate metadata, or generate it for all tables and all folders by default.
Ensure a seamless user experience by eliminating the need for users to manually fill in metadata and avoiding any duplication or overriding of the metadata that is generated automatically by the feature.

2/ Metadata Review and Acceptance:

After the automated metadata generation, display the recommended metadata in an interface for the data producer to review the AI-generated metadata, make edits, and annotate the information to ensure accuracy and relevance.
Implement a "Accept Recommendation", "Edit Recommendation" and "Reject Recommendation" action, allowing the data producer to control which metadata is persisted in the data.all catalog for their dataset.

3/ Backward Compatibility for Existing Datasets:

Extend the "Generate Metadata" functionality to support data producers' existing datasets in data.all.
Provide a way for data producers to trigger the automated metadata generation for their existing datasets, ensuring backward compatibility and enabling them to update the metadata for existing data assets.

4/ On-demand Metadata Refresh:

Offer a user-friendly action for data producers to initiate on-demand automated metadata recommendation in events of changes to existing table schema or when new tables are added to ensure completeness and correctness.
This process still needs to follow Metadata Review and Acceptance workflow as described in 2/

5/ Metadata-driven Search and Filtering:

Leverage the accepted metadata to enhance the data.all search and filtering capabilities, enabling data consumers to quickly discover relevant datasets based on the enriched information.

Out of Scope

Bring Your Own Model: The automated metadata recommendation feature will not support the ability for users to bring their own language models.
Fine Tuning: This feature doesn’t include fine tuning of LLM to get a customized model. This has been kept this way as data.all is deployed in a customer environment and due to lack of data on user executed requests and fine-tuning requires a significant size of data to align the model to a particular domain or task.
Role Management: The automated metadata recommendation feature will assume the same role of a generic data producer persona and will not customized for different user personas.

Guardrails

Transparency and Disclosure: This feature is in an experimental stage. The metadata provided should be considered as a starting point, and users are encouraged to "trust but verify" the information, as there may be limitations or uncertainties in the responses.
Truthfulness and Integrity: The feature aims to provide truthful and complete metadata information to the best of its abilities. However, it is possible that the dataset summaries, column names, or descriptions may not be entirely accurate. Users should review the metadata carefully and report any issues or discrepancies.
Clear and Informative Error Messages: If the model encounters any issues or is unable to provide the requested metadata, it will provide clear error message instead of generating incorrect response.
Human Review and Acceptance: After the model generates the metadata response, the user will be prompted to review the information. The user must explicitly accept the metadata before it can be used or saved. This human-in-the-loop approach ensures that the metadata is verified and approved before being utilized.
Cost: Usage will be restricted to a specific metric per day per user to promote responsible use. The choice of model will be determined through a frugal evaluation of functionality and cost. Estimated usage costs will be published to allow customers to make informed decisions.

Describe alternatives you've considered
See design below

Additional context
This feature will be first implemented as an MVP and then reowrked a bit to make it prod-ready.

P.S. Please Don't attach files. Add code snippets directly in the message body instead.

### Feature - Feature ### Detail - Automated metadata generation using gen AI. MVP phase ### Related #1599 By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. --------- Co-authored-by: dlpzx <[email protected]>

dlpzx · 2024-10-06T16:31:08Z

Design

User Experience

Table/Folder metadata generation

User navigates to data.all Dataset view and clicks on “Generate Metadata” button. When clicking on the button a modal window opens (generateMetadataSelectorWindow). In this window the user can select either the Dataset OR a selection of tables and folders for which to generate the metadata. The user checks a selection of tables and folders and presses “Generate Metadata” at the bottom of the modal window.
A new modal window appears where the user can see the current metadata values and the suggested new values (we might need to create a second modal for a detail view per table/folder). The user can edit the suggested metadata directly. In addition, he can click on the following buttons:
- Reject suggestions - it closes the window and it keeps the old metadata values
- Re-generate metadata - it creates new suggestions
- Save - saves the current suggested metadata (directly generated or generated+edited)
The suggested metadata includes:
- table labels*, descriptions, tags, topics + column descriptions
- folder labels*, descriptions, tags, topics

Dataset metadata generation

User navigates to data.all Dataset view and clicks on “Generate Metadata” button. When clicking on the button a modal window opens. In this window the user can select either the Dataset OR a selection of tables and folders for which to generate the metadata. The user selects the Dataset and presses “Generate Metadata” at the bottom of the modal window.
The API evaluates whether the folders and tables metadata is enough to generate metadata for the dataset.
If no:
- It re-opens the generateMetadataSelectorWindow with a warning message on top that tells the user that before creating metadata for the Dataset, we need to create metadata for the tables and folders.
If yes:
- A new modal window appears where the user can see the current metadata values and the suggested new values. The user can edit the suggested metadata directly. In addition, he can click on the following buttons:
  - Reject suggestions - it closes the window and it keeps the old metadata values
  - Re-generate metadata - it creates new suggestions
  - Save - saves the current suggested metadata (directly generated or generated+edited)
- The suggested metadata includes:
  - dataset labels*, description, tags, topics

There will be a limit of Generate Metadata API calls performed per day or per day/team. If the number is surpassed, a comprehensive error message will appear in the top banner.

Data analysis

For this use-case it is relevant to describe the different types of data and metadata that would serve as input to the generation of metadata. Depending on the data there will be different genAI workflows.

Data.all S3 Datasets: (S3 Bucket + Glue database)

Tables - Glue tables containing structured data. There is technical information stored in the Glue Catalog, including: column labels, column descriptions...
Folders - S3 Prefix that could contain any type of file.

Data.all Redshift Datasets [v.2.7.0] : We need to keep it in mind for the design, but the feature won’t be implementing metadata in Redshift in its first release.

Tables - Redshift tables containing structured data.

Data scenarios

For column metadata generation (column name and column description):

Scenario	Input data for genAI	Comments
Glue tables with meaningful column names and description	Use the column description to verify if the name is good and viceversa
Glue tables with no column descriptions and cryptic names	Random selection 100 items of the table (like current preview) + metadata in RDS

For Table and Folder metadata generation:

Scenario	Input data for genAI	Comments
Tables with meaningful metadata	Metadata in RDS
Tables with poor metadata	Select randomized items of the table (like current preview) + metadata in RDS
Folders containing files	Read file names and extensions to produce a summary

For Dataset metadata generation

Scenario	Input data for genAI	Comments
Folders and Tables with meaningful metadata	Summary of table and folder descriptions
Folders and Tables with poor metadata	Generate metadata for tables and folders and then generate metadata for Dataset

High Level Design

User logs into data.all
In the DatasetView clicks on button “generateMetadata”
A modal window opens MetadataSelectionModal in which the user can select either the Dataset or a list of tables and folders. Finally it hits a button “generate”
An API call is made to the handler, that fetches data from RDS about the dataset or table or folder. The handler invokes Bedrock sending the RDS metadata.
Bedrock LLM decides if the metadata is enough to generate the metadata. If yes it executes step 11, generates description and other metadata and returns response.
If no, it returns “NotEnoughData”
In the UI the user gets presented with a button to read sample data to generate metadata. When clicking it hits the readSampleData API
readSampleData API reads a random sample of data using Athena or S3 respectively.
The sample data is returned to the UI view, which was loading while the sample data was being read.
Once it has returned we can automatically or the user proactively trigger the API call generate metadata again sending this sample data.
The api handler calls Bedrock again, this time with sample data. In Bedrock the LLM generates a description and other metadata (label, tags...)
the metadata is returned to the UI
metadata is displayed in UI modal window MetadataReviewModal, where the user can reject the suggestion, regenerate the metadata (triggers the flow again with a max of 3 retries), edit the metadata and finally save the metadata.
when clicking save metadata the API save metadata is triggered
the API handler stores the new metadata in RDS

dlpzx mentioned this issue Oct 1, 2024

Automated metadata generation using genAI MVP #1598

Merged

dlpzx self-assigned this Oct 4, 2024

dlpzx added this to v2.8.0 Oct 4, 2024

github-project-automation bot moved this to Nominated in v2.8.0 Oct 4, 2024

dlpzx added type: newfeature New feature request priority: medium effort: medium effort: large priority: high and removed priority: medium labels Oct 4, 2024

dlpzx linked a pull request Oct 29, 2024 that will close this issue

Automated metadata generation using genAI #1670

Draft

dlpzx removed this from v2.8.0 Oct 29, 2024

dlpzx added this to v2.7.0 Oct 29, 2024

github-project-automation bot moved this to Nominated in v2.7.0 Oct 29, 2024

dlpzx moved this from Nominated to In progress in v2.7.0 Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic metadata generation using genAI #1599

Automatic metadata generation using genAI #1599

dlpzx commented Oct 1, 2024

dlpzx commented Oct 6, 2024

Automatic metadata generation using genAI #1599

Automatic metadata generation using genAI #1599

Comments

dlpzx commented Oct 1, 2024

Problem statement

User Stories

US1.

US2.

US3.

US4.

US5.

US7.

US9.

Scope

Out of Scope

Guardrails

dlpzx commented Oct 6, 2024

Design

User Experience

Table/Folder metadata generation

Dataset metadata generation

Data analysis

High Level Design