Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating GPQA Dataset #47

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Integrating GPQA Dataset #47

wants to merge 11 commits into from

Conversation

vshrivas
Copy link
Collaborator

@vshrivas vshrivas commented Nov 6, 2024

No description provided.

@vshrivas
Copy link
Collaborator Author

vshrivas commented Nov 6, 2024 via email

eureka_ml_insights/configs/model_configs.py Outdated Show resolved Hide resolved
tests/pipeline_tests.py Show resolved Hide resolved
tests/data_utils_tests/shuffle_columns_tests.py Outdated Show resolved Hide resolved
eureka_ml_insights/data_utils/transform.py Outdated Show resolved Hide resolved


@dataclass
class ColumnMatchMap(MultiColumnTransform):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like this should be a MultiColumnTransform. Do you intend to support multiple columns as input?

Copy link
Collaborator Author

@vshrivas vshrivas Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this depends on the definition of a MultiColumnTransform. Multiple columns are not getting updated, but we do pass in a list of multiple columns. Specifically the value in key_col is compared to the values in each of the columns passed in, and the value in new_col is assigned accordingly. For example, in the GPQA pipeline this is how we call the transform. The value in "Correct Answer" column is compared to the values in columns "A", "B", "C", and "D". If the value of "Correct Answer" matches with the value of "A", then "A" would be stored in the "ground_truth" column.

ColumnMatchMap(
                                new_col="ground_truth", key_col="Correct Answer", columns=["A", "B", "C", "D"]
                            ),

If this doesn't fit the definition of a MultiColumnTransform, I can change the class to inherit from DFTransformBase instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not a great fit as apparent by the amount of logic that is reused from MultiColumnTransform vs. overridden. You are reusing the validate method of MultiColumnTransform to validate if columns exist in the df, but you're not validating the same for key_col. Also you are having to fully override the transform method from MultiColumnTransform. So overall I think this inheritance is confusing for the reader.

eureka_ml_insights/data_utils/transform.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants