You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have encountered an issue with the Stanza pipeline for the Indonesian language, specifically with the tokenizer processor. The pipeline fails to handle sentence segmentation properly when a sentence ends with a capitalized acronym.
Steps to Reproduce:
Set up the Stanza pipeline with the Indonesian tokenizer processor.
Input a text sequence containing a sentence ending with an acronym followed by another sentence.
importstanzanlp=stanza.Pipeline("id", processors="tokenize")
doc=nlp("Koneksi harus menggunakan cara yang aman: VPN, HTTPS, atau SFTP. Koneksi yang aman harus mencegah informasi pasien jatuh ke tangan pengguna atau penonton yang tidak berwenang.")
# The following line should return 2, but it returns 1print(len(doc.sentences))
forsentindoc.sentences:
print(sent.text)
Expected Behavior:
The output should contain two sentences:
"Koneksi harus menggunakan cara yang aman: VPN, HTTPS, atau SFTP."
"Koneksi yang aman harus mencegah informasi pasien jatuh ke tangan pengguna atau penonton yang tidak berwenang."
Actual Behavior:
The output contains only one sentence:
"Koneksi harus menggunakan cara yang aman: VPN, HTTPS, atau SFTP. Koneksi yang aman harus mencegah informasi pasien jatuh ke tangan pengguna atau penonton yang tidak berwenang."
Environment:
Stanza version: 1.9.2
OS: Ubuntu 22.04.3 LTS
Python version: 3.10.12
Additional Context:
This issue seems to occur because the tokenizer does not recognize the capitalized acronym (e.g., SFTP) as a valid sentence-ending token.
The text was updated successfully, but these errors were encountered:
nxgeo
changed the title
Sentence Segmentation Bug in Indonesian Tokenizer with Acronyms
Sentence Segmentation Bug in Indonesian Tokenizer with a Sentence-Ending Acronym
Sep 21, 2024
Description:
I have encountered an issue with the Stanza pipeline for the Indonesian language, specifically with the tokenizer processor. The pipeline fails to handle sentence segmentation properly when a sentence ends with a capitalized acronym.
Steps to Reproduce:
Expected Behavior:
The output should contain two sentences:
Actual Behavior:
The output contains only one sentence:
Environment:
Additional Context:
This issue seems to occur because the tokenizer does not recognize the capitalized acronym (e.g., SFTP) as a valid sentence-ending token.
The text was updated successfully, but these errors were encountered: