Speeding up mapping with HuggingFace datasets #82

MaveriQ · 2021-06-17T10:15:53Z

Hi. I am trying to convert corpora from HF to their IPA form with the following snippet. But I am getting really slow speeds.. only a couple of examples per second. Do you know how it can be sped up? Thanks

import epitran
from datasets import load_dataset

bookscorpus = load_dataset('bookcorpus',split='train')
epi = epitran.Epitran('eng-Latn')

def transliterate(x):
    return {'trans': epi.transliterate(x['text'])}

tokenized = bookscorpus.map(lambda x: transliterate(x),num_proc=32)

The text was updated successfully, but these errors were encountered:

dmort27 · 2021-11-10T18:44:24Z

Hi @MaveriQ , I seem to have missed this message when you originally sent it. It is true that the Python implementation of Epitran is very slow. If you can take advantage of concurrency, you can do better. I am in the process of rewriting parts of Epitran in Rust, which will increase performance—based on initial tests—by about 100 times.

dmort27 · 2021-11-10T18:46:10Z

However, for English this will not help as much since English support is provided by Flite (written in C) and the only python part converts the ARPAbet representation from Flite to IPA.

MaveriQ · 2021-11-12T08:38:47Z

Hi. If by concurrency you meant multiprocessing, I already tried that, but it's still pretty slow. Can you recommend anything else for English? Thanks

dmort27 · 2021-11-12T13:37:04Z

As I look at this issue, the real problem is that Epitran spawns a shell every time it calls lex_lookup to convert a word to IPA. This is expensive (although lex_lookup itself is quite efficient). The solutions would be to:

Create Python bindings to the Flite libraries, so the relevant functions could be called directly (rather than via the shell)
Create a method to epitran.flite.Flite that passes inputs to lex_lookup by the batch.

The second solution would be easier, but ultimately less satisfying.

juice500ml · 2024-02-29T06:35:23Z

FYI for future reference, extremely dirty quickfix would be:

import re

words = [...]  # words to be transliterated
with open("eng_words.sh", "w") as f:
    f.write("\n".join([f"lex_lookup {w} | head -n 1" for w in words]))

!bash eng_words.sh > eng_lex.txt

lexs = open("eng_lex.txt").readlines()
arpa_to_ipa = {'ey': 'ej', 'ae': 'æ', 'iy': 'i', 'eh': 'ɛ', 'ay': 'aj', 'ih': 'ɪ', 'ow': 'ow', 'aa': 'ɑ', 'ao': 'ɔ', 'aw': 'aw', 'oy': 'oj', 'ah': 'ʌ', 'ax': 'ə', 'uw': 'u', 'uh': 'ʊ', 'er': 'ɹ̩', 'b': 'b', 'ch': 't͡ʃ', 'd': 'd', 'dx': 'ɾ', 'f': 'f', 'g': 'ɡ', 'hh': 'h', 'jh': 'd͡ʒ', 'k': 'k', 'l': 'l', 'em': 'm̩', 'm': 'm', 'en': 'n̩', 'n': 'n', 'ng': 'ŋ', 'p': 'p', 'q': 'ʔ', 'r': 'ɹ', 's': 's', 'sh': 'ʃ', 't': 't', 'dh': 'ð', 'th': 'θ', 'v': 'v', 'w': 'w', 'y': 'j', 'z': 'z', 'zh': 'ʒ'}

ipas = []
for lex in lexs:
    lex = lex.strip()[1:-1].split()
    lex = map(lambda d: re.sub(r'\d', '', d), lex)
    ipa = map(lambda d: arpa_to_ipa[d], lex)
    ipas.append("".join(list(ipa)))

word_to_ipa = dict(zip(words, ipas))  # now this dict has key: word, value: transliteration result

dmort27 closed this as completed Nov 10, 2021

dmort27 reopened this Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speeding up mapping with HuggingFace datasets #82

Speeding up mapping with HuggingFace datasets #82

MaveriQ commented Jun 17, 2021

dmort27 commented Nov 10, 2021

dmort27 commented Nov 10, 2021

MaveriQ commented Nov 12, 2021

dmort27 commented Nov 12, 2021

juice500ml commented Feb 29, 2024

Speeding up mapping with HuggingFace datasets #82

Speeding up mapping with HuggingFace datasets #82

Comments

MaveriQ commented Jun 17, 2021

dmort27 commented Nov 10, 2021

dmort27 commented Nov 10, 2021

MaveriQ commented Nov 12, 2021

dmort27 commented Nov 12, 2021

juice500ml commented Feb 29, 2024