Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence Segmentation Bug in Indonesian Tokenizer with a Sentence-Ending Acronym #1424

Open
nxgeo opened this issue Sep 21, 2024 · 0 comments
Labels

Comments

@nxgeo
Copy link

nxgeo commented Sep 21, 2024

Description:

I have encountered an issue with the Stanza pipeline for the Indonesian language, specifically with the tokenizer processor. The pipeline fails to handle sentence segmentation properly when a sentence ends with a capitalized acronym.

Steps to Reproduce:

  1. Set up the Stanza pipeline with the Indonesian tokenizer processor.
  2. Input a text sequence containing a sentence ending with an acronym followed by another sentence.
import stanza

nlp = stanza.Pipeline("id", processors="tokenize")

doc = nlp("Koneksi harus menggunakan cara yang aman: VPN, HTTPS, atau SFTP. Koneksi yang aman harus mencegah informasi pasien jatuh ke tangan pengguna atau penonton yang tidak berwenang.")

# The following line should return 2, but it returns 1
print(len(doc.sentences))

for sent in doc.sentences:
    print(sent.text)

Expected Behavior:

The output should contain two sentences:

  1. "Koneksi harus menggunakan cara yang aman: VPN, HTTPS, atau SFTP."
  2. "Koneksi yang aman harus mencegah informasi pasien jatuh ke tangan pengguna atau penonton yang tidak berwenang."

Actual Behavior:

The output contains only one sentence:

  • "Koneksi harus menggunakan cara yang aman: VPN, HTTPS, atau SFTP. Koneksi yang aman harus mencegah informasi pasien jatuh ke tangan pengguna atau penonton yang tidak berwenang."

Environment:

  • Stanza version: 1.9.2
  • OS: Ubuntu 22.04.3 LTS
  • Python version: 3.10.12

Additional Context:

This issue seems to occur because the tokenizer does not recognize the capitalized acronym (e.g., SFTP) as a valid sentence-ending token.

@nxgeo nxgeo added the bug label Sep 21, 2024
@nxgeo nxgeo changed the title Sentence Segmentation Bug in Indonesian Tokenizer with Acronyms Sentence Segmentation Bug in Indonesian Tokenizer with a Sentence-Ending Acronym Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant