Fused words in Universal Dependencies #17

dan-zeman · 2015-08-31T13:19:23Z

From ufal/lindat-corpora-conversions#3 (comment) :

I think we need a better representation of fused tokens in Treex. Now it is just sketched using the wild attributes but it will probably be needed in future, as it is part of the UD guidelines. So we need a less wild solution. Once we have it, we could try to implement directly in Treex the heuristics that will collapse fused words whenever desirable. And once we have this, we should probably use it before exporting data for Kontext. Because the surface matters here.

martinpopel · 2015-08-31T15:31:42Z

I agree we need a better (less wild) API for fused (aka multi-word) tokens in Treex.

I am not sure how it will solve the problem in KonText, which probably can display either only tokens or only words. There are scripts distributed with UD (e.g. conllu-w2t.py) for converting the CoNLL-U word-indexed format to other formats.

See also
http://universaldependencies.github.io/docs/cs/overview/tokenization.html
http://universaldependencies.github.io/docs/u/overview/tokenization.html
http://universaldependencies.github.io/docs/format.html#words-and-tokens

dan-zeman added the enhancement label Aug 31, 2015

dan-zeman self-assigned this Aug 31, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused words in Universal Dependencies #17

Fused words in Universal Dependencies #17

dan-zeman commented Aug 31, 2015

martinpopel commented Aug 31, 2015

Fused words in Universal Dependencies #17

Fused words in Universal Dependencies #17

Comments

dan-zeman commented Aug 31, 2015

martinpopel commented Aug 31, 2015