EdgyDebug: MDEX Engine 6.4 language-specific, dictionary based, linguistic analysis features

Saturday, 14 February 2015

Here is the list of new language specific features of MDEX 6.4.

Segmentation The process of breaking up the non-whitespace language's text into meaningful units.

Tokenization The process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements.

Orthographic normalization The creation of a standard indexed form for diacritic marks.

Decompounding The decomposition of compound word forms into their base terms.

Dynamic stemming The process of determining the base form of a word; a process based on dictionary entries and language specific rules.

Stop words A list of words to be ignored by the Endeca MDEX Engine. Sample stop word lists are now provided for each supported language.

EdgyDebug