Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent word swap #752

Merged
merged 6 commits into from
Nov 5, 2023
Merged

Consistent word swap #752

merged 6 commits into from
Nov 5, 2023

Conversation

k-ivey
Copy link
Collaborator

@k-ivey k-ivey commented Oct 5, 2023

What does this PR do?

Summary

This PR adds a new consistent parameter to WordSwapChangeLocation and WordSwapChangeName so that all original instances of the same word get updated to the same new word. This allows for consistency when the semantics of the words matter. For example, "Alice likes Bob. Alice does not like Eve." could be transformed into "Kate likes Bob. Kate does not like Eve" if consistent = True. Otherwise, the text could be transformed into "Alice likes Bob. Kate does not like Eve." which alters the semantics of the sentence.

Additions

  • Added consistent parameter to WordSwapChangeLocation and WordSwapChangeName.

Changes

  • WordSwapChangeLocation has been updated to only make one modification to the original text for marginal performance gains.
  • WordSwapChangeLocation has been updated to correctly capitalize locations consisting of more than one word to match the naming convention used in textattack/shared/data.py. This change allows for replacement words for such multi-word locations to be found.

Checklist

  • [ x ] The title of your pull request should be a summary of its contribution.
  • [ x ] Please write detailed description of what parts have been newly added and what parts have been modified. Please also explain why certain changes were made.
  • [ x ] If your pull request addresses an issue, please mention the issue number in the pull request description to make sure they are linked (and people consulting the issue know you are working on it)
  • [ x ] To indicate a work in progress please mark it as a draft on Github.
  • [ x ] Make sure existing tests pass.
  • [ x ] Add relevant tests. No quality testing = no merge.
  • [ x ] All public methods must have informative docstrings that work nicely with sphinx. For new modules/files, please add/modify the appropriate .rst file in TextAttack/docs/apidoc.'

@qiyanjun
Copy link
Member

@k-ivey please add test function showing the revised parts are working as intended.

@k-ivey
Copy link
Collaborator Author

k-ivey commented Oct 26, 2023

@qiyanjun I added two new tests in tests/test_transformations.py:

  • test_word_swap_change_location_consistent()
  • test_word_swap_change_name_consistent()

Both tests modify an initial text with duplicated location or name and verify that the transformed text contains no instances of the initial location or name.

Since the PyTest action appears to be failing due to memory issues, I'll introduces changes to the workflow to increase the swap size of the machine.

@qiyanjun qiyanjun self-assigned this Oct 28, 2023
@qiyanjun
Copy link
Member

@k-ivey please sync your branch with the main

@k-ivey
Copy link
Collaborator Author

k-ivey commented Oct 28, 2023

@qiyanjun I synced my branch and the tests now pass

@qiyanjun qiyanjun requested review from qiyanjun and removed request for Hanyu-Liu-123 October 29, 2023 02:35
Copy link
Member

@qiyanjun qiyanjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name change seems ok.

But the location change needs some debugging.

for instance:


s = "I am in New York. I love living in San Diego. San Diego is better than New York"
>>> s_augmented = augmenter.augment(s)
>>> print(s_augmented)
['I am in New York. I love living in San Diego. Everett is better than New York']
>>> 

@k-ivey
Copy link
Collaborator Author

k-ivey commented Nov 1, 2023

I've looked a bit into this and it appears to be a consequence of the NER tagger along with multi-word locations. For the sentence

s = "I am in New York. I love living in San Diego. San Diego is better than New York"

The first instance of "New York" is tagged as a location. However, for the second instance of "New York", only "York" is tagged as a location. Similarly, the second instance of "San Diego" is tagged as a location, but for the first instance, only "San" is tagged as a location. I'll work some more on this, as it will be tricky to handle edge cases such as text that has "San Diego" and "San Francisco".

@k-ivey
Copy link
Collaborator Author

k-ivey commented Nov 2, 2023

Looking into why the formatting check fails. When I run it locally, it passes:

$ black textattack/transformations/ --check
All done! ✨ 🍰 ✨
39 files would be left unchanged.

Resolved: Updating my local versions of black and isort fixed this.

@k-ivey
Copy link
Collaborator Author

k-ivey commented Nov 2, 2023

I've made changes to address the issue. In short, the issue was that the NER tagger may not tag all words in a multi-word location (e.g. New York). To fix this, we can find all windows that the NER tagger identifies as a location, sort these by the number of words in the window, and then extend the shorter windows to see if the extended window is a location we've already seen. If so, the update the map of location to its appearances. As a concrete example:

s = "I am in New York. I love living in San Diego. San Diego is better than New York"

The locations identified by NER are [ [[3, 4], "New York"], [[9], "San"], [[11, 12], "San Diego"], [[17], "York"], where the indices are word indices in the sentence. The longest window is 2 and the shortest is 1, so we need to expand the windows of length 1 to length 2:

  • For "San" at word index 9, we expand it to "in San" and "San Diego". The latter is a location that already appears, so the value for "San Diego" in the map is updated to include this expanded window.
  • For "York" at word index 17, we expand it to "New York", so the value for "San Diego" in the map is updated to include this expanded window.

This does not handle all issues with the NER tagger. For example, in the example sentence above, if the NER tagger only tagged the first instance of "New York" as a location and did not tag either "New" or "York" as a location in the second instance, then the second instance would never be updated.

@qiyanjun
Copy link
Member

qiyanjun commented Nov 5, 2023

It still has not totally solved the change location issues.. but better than the current version.. so it merged

>>> s = "I am in New York. San Diego is better than New York"
>>> s_augmented = augmenter.augment(s)
>>> print(s_augmented)
['I am in New York. San Diego is better than New Grand Island']

@qiyanjun qiyanjun merged commit db4ae20 into QData:master Nov 5, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants