Skip to content

module__org.bibliome.alvisnlp.modules.segmig.WoSMig

Robert Bossy edited this page Jul 27, 2017 · 1 revision

#org.bibliome.alvisnlp.modules.segmig.WoSMig

Synopsis

Performs word segmentation on section contents.

Description

org.bibliome.alvisnlp.modules.segmig.WoSMig searches for word boundaries in the section contents, creates an annotation for each word and adds it to the layer targetLayerName. The following are considered as word boundaries:

  • consecutive whitespace characters, including ' ', newline, carriage return and horizontal tabulation;
  • the positions before and after each punctuation character defined in punctuation and balancedPunctuations, thus a punctuation character always form a single-character word, a balanced punctuation breaks a word iff the corresponding punctuation is found.

If fixedFormLayerName is defined then non-overlapping annotations in this layer will be added as is in targetLayerName, the start and end positions of these annotations are considered as word boundaries and no word boundary is searched inside.

The created annotations have the feature annotationTypeFeature with a value corresponding to the word type:

  • punctuation: if the word is a single-character punctuation;

  • word: if the word is a plain non-punctuation word.

    The [eosStatusFeature](#eosStatusFeature) feature contains the end-of-sentence status of the word:
    
  • not-eos: if the word cannot be an end of sentence;

  • maybe-eos: if the word may be an end of sentence;

  • eos: if the word is definitely an end of sentence.

Parameters

Optional

Type: Mapping

Constant features to add to each annotation created by this module

Optional

Type: String

Name of the layer containing annotations that should not be split into several words.

Default value: length

Type: AnnotationComparator

Comparator to use when removing overlapping fixed form annotations.

Default value: wordType

Type: String

Name of the feature where to put the word type (word, punctuation, etc).

Default value: ()[]{}""

Type: String

Balanced punctuation characters. The opening punctuation must be immediately followed by the corresponding closing punctuation. If this parameter value has an odd length, then a warning will be issued and the last character will be ignored.

Default value: true

Type: Expression

Only process document that satisfy this filter.

Default value: fixed

Type: String

Value of the type feature for annotations copied from fixed forms.

Default value: punctuation

Type: String

Value of the type feature for punctuation annotations.

Default value: ?.!;,:-

Type: String

List of punctuations, be them weak or strong.

Default value: true

Type: Expression

Process only sections that satisfy this filter.

Default value: words

Type: String

Layer where to store word annotations.

Default value: word

Type: String

Value of the type feature for regular word annotations.

Clone this wiki locally