Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping from word sequences to subword sequences #4

Open
KazutoshiShinoda opened this issue Oct 29, 2021 · 0 comments
Open

Mapping from word sequences to subword sequences #4

KazutoshiShinoda opened this issue Oct 29, 2021 · 0 comments

Comments

@KazutoshiShinoda
Copy link

KazutoshiShinoda commented Oct 29, 2021

def match_tokenized_to_untokenized(self, tokenized_sent, untokenized_sent):

Regarding this function, I found the following error case.
Even though this may be a minor error, just for your information.

from transformers import AutoTokenizer

# preparing an example
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
untokenized_sent = 'pretrained language models prone to learn domain-specific spurious correlations between input and output .'.split()
tokenized_sent = tokenizer.tokenize(tokenizer.cls_token + ' '.join(untokenized_sent) + tokenizer.sep_token)

# exactly the same as `match_tokenized_to_untokenized` for generating `mapping`
mapping = defaultdict(list)
untokenized_sent_index = 0
tokenized_sent_index = 1
while (untokenized_sent_index < len(untokenized_sent) and tokenized_sent_index < len(tokenized_sent)):
    while (tokenized_sent_index + 1 < len(tokenized_sent) and tokenized_sent[tokenized_sent_index + 1].startswith('##')):
        mapping[untokenized_sent_index].append(tokenized_sent_index)
        tokenized_sent_index += 1
    mapping[untokenized_sent_index].append(tokenized_sent_index)
    untokenized_sent_index += 1
    tokenized_sent_index += 1

# verifying the mapping is correct or not
for i in mapping:
    j = mapping[i]
    print(untokenized_sent[i], tokenized_sent[j[0]:j[-1]+1])

Result:

pretrained ['pre', '##train', '##ed']
language ['language']
models ['models']
prone ['prone']
to ['to']
learn ['learn']
**domain-specific ['domain']**
spurious ['-']
correlations ['specific']
between ['spur', '##ious']
input ['correlation', '##s']
and ['between']
output ['input']
. ['and']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant