detokenize and correct_spaces problem with hyphens and En dashes #51

atlijas · 2024-09-17T14:42:17Z

Using the newest version of Tokenizer, 3.4.5:

>>> from tokenizer import split_into_sentences, detokenize, tokenize, correct_spaces
# En dash and detokenize
>>> sent = 'Hamarinn dugir – og meira en það.'
>>> detokenize(tokenize(sent))
# Expected output: 'Hamarinn dugir – og meira en það.'
# Output: 'Hamarinn dugir–og meira en það.'

# En dash and correct_spaces
>>> s = list(split_into_sentences(sent))[0]
>>> correct_spaces(s)
# Expected output: 'Hamarinn dugir – og meira en það.'
# Output: 'Hamarinn dugir–og meira en það.'

# Hyphen and detokenize
>>> sent = 'Hamarinn dugir - og meira en það.'
>>> detokenize(tokenize(sent))
# Expected output: 'Hamarinn dugir - og meira en það.'
# Output: 'Hamarinn dugir-og meira en það.'

# Hyphen and correct_spaces
>>> s = list(split_into_sentences(sent))[0]
>>> correct_spaces(s)
# Expected output: 'Hamarinn dugir - og meira en það.'
# Output: 'Hamarinn dugir- og meira en það.'

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detokenize and correct_spaces problem with hyphens and En dashes #51

detokenize and correct_spaces problem with hyphens and En dashes #51

atlijas commented Sep 17, 2024

detokenize and correct_spaces problem with hyphens and En dashes #51

detokenize and correct_spaces problem with hyphens and En dashes #51

Comments

atlijas commented Sep 17, 2024