Mapping from word sequences to subword sequences #4

KazutoshiShinoda · 2021-10-29T10:08:53Z

description-length-probing/control_tasks/control_tasks/data.py

Line 325 in 2696af0

def match_tokenized_to_untokenized(self, tokenized_sent, untokenized_sent):

Regarding this function, I found the following error case.
Even though this may be a minor error, just for your information.

from transformers import AutoTokenizer

# preparing an example
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
untokenized_sent = 'pretrained language models prone to learn domain-specific spurious correlations between input and output .'.split()
tokenized_sent = tokenizer.tokenize(tokenizer.cls_token + ' '.join(untokenized_sent) + tokenizer.sep_token)

# exactly the same as `match_tokenized_to_untokenized` for generating `mapping`
mapping = defaultdict(list)
untokenized_sent_index = 0
tokenized_sent_index = 1
while (untokenized_sent_index < len(untokenized_sent) and tokenized_sent_index < len(tokenized_sent)):
    while (tokenized_sent_index + 1 < len(tokenized_sent) and tokenized_sent[tokenized_sent_index + 1].startswith('##')):
        mapping[untokenized_sent_index].append(tokenized_sent_index)
        tokenized_sent_index += 1
    mapping[untokenized_sent_index].append(tokenized_sent_index)
    untokenized_sent_index += 1
    tokenized_sent_index += 1

# verifying the mapping is correct or not
for i in mapping:
    j = mapping[i]
    print(untokenized_sent[i], tokenized_sent[j[0]:j[-1]+1])

Result:

pretrained ['pre', '##train', '##ed']
language ['language']
models ['models']
prone ['prone']
to ['to']
learn ['learn']
**domain-specific ['domain']**
spurious ['-']
correlations ['specific']
between ['spur', '##ious']
input ['correlation', '##s']
and ['between']
output ['input']
. ['and']

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping from word sequences to subword sequences #4

Mapping from word sequences to subword sequences #4

KazutoshiShinoda commented Oct 29, 2021 •

edited

Loading

Mapping from word sequences to subword sequences #4

Mapping from word sequences to subword sequences #4

Comments

KazutoshiShinoda commented Oct 29, 2021 • edited Loading

KazutoshiShinoda commented Oct 29, 2021 •

edited

Loading