Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CleanCoNLL object #3557

Merged
merged 12 commits into from
Dec 6, 2024
Merged

Add CleanCoNLL object #3557

merged 12 commits into from
Dec 6, 2024

Conversation

susannaruecker
Copy link
Collaborator

Here's the PR for adding a CleanCoNLL object. Simple usage:

from flair.datasets import CLEANCONLL

cleanconll = CLEANCONLL()
print(cleanconll)

When called for the first time, this will download the necessary files, so

It then applies the patch files to the original CoNLL-03 tokens (for our new line break etc.) and then merges those new tokens with our new annotations.

Note: As requested, I replaced all previous usage of subcalling bash scripts with pure python. Especially the patching process which until now was

subprocess.run(['patch', str(file_path), str(patch_path), '-o', output_path])

is now done with own methods, but unfortunately rather lengthy now...

Please check if it works for you! 🙂 (If you already have the CleanCoNLL files in your .flair/datasets folder you should delete those before, otherwise those simply will be read and the new reconstruction is not tested)

Copy link
Collaborator

@elenamer elenamer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding this and implementing the patching in python! :)

I'm getting an error TypeError: CLEANCONLL.download_and_prepare_data() takes 1 positional argument but 2 were given, and I saw that the same one is appearing when the tests are run.

Also, so that all checks pass, you need to run mypy, ruff and black for checks and formatting

changes = []
current_change = None

with open(patch_file_path, 'r') as patch_file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to also specify an encoding when reading/writing files, as different os have different default encodings

Copy link
Collaborator Author

@susannaruecker susannaruecker Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I fixed the deprecated argument and added the encoding everywhere.

@susannaruecker
Copy link
Collaborator Author

Now the checks pass, after some formatting fixes.

@stefan-it
Copy link
Member

Hi @susannaruecker ,

many thanks for adding this! I have trained FLERT models on the CleanCoNLL dataset with very great results.

I have one question about further experiments with that dataset: is it possible that the SpanTagger training also gets officially integrated into Flair - I opened this issue some time ago #3457 and it would be awesome to have support in Flair for this approach as well :)

@helpmefindaname helpmefindaname merged commit 8ae1ab8 into master Dec 6, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants