Saturday, 14 February 2015

MDEX Engine 6.4 language-specific, dictionary based, linguistic analysis features

Here is the list of new language specific features of  MDEX 6.4.

Segmentation The process of breaking up the non-whitespace language's text into meaningful units.
Tokenization The process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements.
Orthographic normalization The creation of a standard indexed form for diacritic marks.
Decompounding The decomposition of compound word forms into their base terms.
Dynamic stemming The process of determining the base form of a word; a process based on dictionary entries and language specific rules.
Stop words A list of words to be ignored by the Endeca MDEX Engine. Sample stop word lists are now provided for each supported language.

No comments:

Post a Comment