GATE PR doesn't separate tokens in the expected way #1

leondz · 2011-06-09T23:11:16Z

GATE's ANNIE tokeniser splits on different boundaries to TERNIP's (NLTK). This can cause many TERNIP rules to not match. For example,

nltk.word_tokenize('Example 31/12/2010 text.')
['Example', '31/12/2010', 'text', '.']

Places a dd/mm/yyyy date into one token, whereas ANNIE will give us a SpaceToken, followed by tokens of '31', '/', '12', '/', '2010', and another SpaceToken.

This should be fixed in the GATE plugin (the preprocessing/postprocessing JAPE), so that the ANNIE Tokeniser's output can be mapped slightly more closely to the results of the NLTK tokeniser. It may also help to specify (if not already done) the tokenisation scheme that NLTK expects, to help in other situations where the upstream tokeniser is switched out from the default.

cnorthwood · 2011-06-10T16:03:47Z

NLTK's tokeniser is apparently based on the Penn Treebank rules: http://www.cis.upenn.edu/~treebank/tokenization.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GATE PR doesn't separate tokens in the expected way #1

GATE PR doesn't separate tokens in the expected way #1

leondz commented Jun 9, 2011

cnorthwood commented Jun 10, 2011

GATE PR doesn't separate tokens in the expected way #1

GATE PR doesn't separate tokens in the expected way #1

Comments

leondz commented Jun 9, 2011

cnorthwood commented Jun 10, 2011