-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeding up mapping with HuggingFace datasets #82
Comments
Hi @MaveriQ , I seem to have missed this message when you originally sent it. It is true that the Python implementation of Epitran is very slow. If you can take advantage of concurrency, you can do better. I am in the process of rewriting parts of Epitran in Rust, which will increase performance—based on initial tests—by about 100 times. |
However, for English this will not help as much since English support is provided by Flite (written in C) and the only python part converts the ARPAbet representation from Flite to IPA. |
Hi. If by concurrency you meant multiprocessing, I already tried that, but it's still pretty slow. Can you recommend anything else for English? Thanks |
As I look at this issue, the real problem is that Epitran spawns a shell every time it calls
The second solution would be easier, but ultimately less satisfying. |
FYI for future reference, extremely dirty quickfix would be: import re
words = [...] # words to be transliterated
with open("eng_words.sh", "w") as f:
f.write("\n".join([f"lex_lookup {w} | head -n 1" for w in words]))
!bash eng_words.sh > eng_lex.txt
lexs = open("eng_lex.txt").readlines()
arpa_to_ipa = {'ey': 'ej', 'ae': 'æ', 'iy': 'i', 'eh': 'ɛ', 'ay': 'aj', 'ih': 'ɪ', 'ow': 'ow', 'aa': 'ɑ', 'ao': 'ɔ', 'aw': 'aw', 'oy': 'oj', 'ah': 'ʌ', 'ax': 'ə', 'uw': 'u', 'uh': 'ʊ', 'er': 'ɹ̩', 'b': 'b', 'ch': 't͡ʃ', 'd': 'd', 'dx': 'ɾ', 'f': 'f', 'g': 'ɡ', 'hh': 'h', 'jh': 'd͡ʒ', 'k': 'k', 'l': 'l', 'em': 'm̩', 'm': 'm', 'en': 'n̩', 'n': 'n', 'ng': 'ŋ', 'p': 'p', 'q': 'ʔ', 'r': 'ɹ', 's': 's', 'sh': 'ʃ', 't': 't', 'dh': 'ð', 'th': 'θ', 'v': 'v', 'w': 'w', 'y': 'j', 'z': 'z', 'zh': 'ʒ'}
ipas = []
for lex in lexs:
lex = lex.strip()[1:-1].split()
lex = map(lambda d: re.sub(r'\d', '', d), lex)
ipa = map(lambda d: arpa_to_ipa[d], lex)
ipas.append("".join(list(ipa)))
word_to_ipa = dict(zip(words, ipas)) # now this dict has key: word, value: transliteration result |
Hi. I am trying to convert corpora from HF to their IPA form with the following snippet. But I am getting really slow speeds.. only a couple of examples per second. Do you know how it can be sped up? Thanks
The text was updated successfully, but these errors were encountered: