Here is the list of new language specific features of MDEX 6.4.
Segmentation The process of breaking up the non-whitespace language's text into meaningful units.
Tokenization The process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements.
Orthographic normalization The creation of a standard indexed form for diacritic marks.
Decompounding The decomposition of compound word forms into their base terms.
Dynamic stemming The process of determining the base form of a word; a process based on dictionary entries and language specific rules.
Stop words A list of words to be ignored by the Endeca MDEX Engine. Sample stop word lists are now provided for each supported language.
No comments:
Post a Comment