Skip to content
This repository has been archived by the owner on May 28, 2024. It is now read-only.

GATE PR doesn't separate tokens in the expected way #1

Open
leondz opened this issue Jun 9, 2011 · 1 comment
Open

GATE PR doesn't separate tokens in the expected way #1

leondz opened this issue Jun 9, 2011 · 1 comment

Comments

@leondz
Copy link

leondz commented Jun 9, 2011

GATE's ANNIE tokeniser splits on different boundaries to TERNIP's (NLTK). This can cause many TERNIP rules to not match. For example,

nltk.word_tokenize('Example 31/12/2010 text.')
['Example', '31/12/2010', 'text', '.']

Places a dd/mm/yyyy date into one token, whereas ANNIE will give us a SpaceToken, followed by tokens of '31', '/', '12', '/', '2010', and another SpaceToken.

This should be fixed in the GATE plugin (the preprocessing/postprocessing JAPE), so that the ANNIE Tokeniser's output can be mapped slightly more closely to the results of the NLTK tokeniser. It may also help to specify (if not already done) the tokenisation scheme that NLTK expects, to help in other situations where the upstream tokeniser is switched out from the default.

@cnorthwood
Copy link
Owner

NLTK's tokeniser is apparently based on the Penn Treebank rules: http://www.cis.upenn.edu/~treebank/tokenization.html

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants