Sentence Segmentation Bug in Indonesian Tokenizer with a Sentence-Ending Acronym #1424

nxgeo · 2024-09-21T11:04:28Z

Description:

I have encountered an issue with the Stanza pipeline for the Indonesian language, specifically with the tokenizer processor. The pipeline fails to handle sentence segmentation properly when a sentence ends with a capitalized acronym.

Steps to Reproduce:

Set up the Stanza pipeline with the Indonesian tokenizer processor.
Input a text sequence containing a sentence ending with an acronym followed by another sentence.

import stanza

nlp = stanza.Pipeline("id", processors="tokenize")

doc = nlp("Koneksi harus menggunakan cara yang aman: VPN, HTTPS, atau SFTP. Koneksi yang aman harus mencegah informasi pasien jatuh ke tangan pengguna atau penonton yang tidak berwenang.")

# The following line should return 2, but it returns 1
print(len(doc.sentences))

for sent in doc.sentences:
    print(sent.text)

Expected Behavior:

The output should contain two sentences:

"Koneksi harus menggunakan cara yang aman: VPN, HTTPS, atau SFTP."
"Koneksi yang aman harus mencegah informasi pasien jatuh ke tangan pengguna atau penonton yang tidak berwenang."

Actual Behavior:

The output contains only one sentence:

"Koneksi harus menggunakan cara yang aman: VPN, HTTPS, atau SFTP. Koneksi yang aman harus mencegah informasi pasien jatuh ke tangan pengguna atau penonton yang tidak berwenang."

Environment:

Stanza version: 1.9.2
OS: Ubuntu 22.04.3 LTS
Python version: 3.10.12

Additional Context:

This issue seems to occur because the tokenizer does not recognize the capitalized acronym (e.g., SFTP) as a valid sentence-ending token.

nxgeo added the bug label Sep 21, 2024

nxgeo changed the title ~~Sentence Segmentation Bug in Indonesian Tokenizer with Acronyms~~ Sentence Segmentation Bug in Indonesian Tokenizer with a Sentence-Ending Acronym Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence Segmentation Bug in Indonesian Tokenizer with a Sentence-Ending Acronym #1424

Sentence Segmentation Bug in Indonesian Tokenizer with a Sentence-Ending Acronym #1424

nxgeo commented Sep 21, 2024 •

edited

Loading

Sentence Segmentation Bug in Indonesian Tokenizer with a Sentence-Ending Acronym #1424

Sentence Segmentation Bug in Indonesian Tokenizer with a Sentence-Ending Acronym #1424

Comments

nxgeo commented Sep 21, 2024 • edited Loading

nxgeo commented Sep 21, 2024 •

edited

Loading