A fast corpus-based stemmer



Download 12.06 Kb.
Date04.01.2022
Size12.06 Kb.
#58015
Summary

A fast corpus-based stemmer

To assess the performance of IR model based on stemming activity, Paik and Parui (2011) proposed an unsupervised corpus-based stemmer for 4 languages including Marathi language. This proposed stemmer finds all the unique words after removing the number and stopwords. Then the suffixes are collected based on the frequency in the lexicon and then generate the class of each suffix. This stemmer is then compared with other stemmers such as no-stemming, YASS (Yet Another Suffix Stemmer), Oard, and n-gram (here n=4). For retrieval experiments, the author has extracted the articles from Maharashtra Time and Sakal newspaper that contains 854324 distinct words of 99,362 documents for Marathi language. While evaluating IR, author has achieved better MAP for 4-gram compared with all other stemmers including the proposed stemmer.

A Novel Corpus-Based Stemming Algorithm using Co-occurrence Statistics

Paik et al. (2011) has proposed the first co-occurrence based stemmer which was finding the two word variants based on the strength of edge of a graph. Then the partition algorithm was applied to the crated graph which then creates the group of those words which has strongest neighbors. The proposed stemmer employed the new co-occurrence strength statistics of the grouped word from the corpus. This stemmer is then compared with other stemmers such as no-stemming, Rule-based, XU, and YASS. For retrieval experiments, the author has extracted the articles FIRE 2008 and 2010 test collection dataset that comprise 99,362 documents for Marathi language. While evaluating IR, author has achieved better MAP the proposed stemmer

GRAS: An effective and efficient stemming algorithm for information retrieval

Paik et al. (2011) proposed GRAS a graph based stemmer for 7 languages including Marathi language. This proposed stemmer finds all the unique words after removing the number and stopwords. Then the pair of co-occuring suffixes are identified and their relative frequency in the lexicon and then generate the class by using GRAS of each suffix pair. This stemmer is then compared with other stemmers such as no-stemming, Rule-based, YASS, Oard, and Linguistica. For retrieval experiments, the author has extracted the articles from Maharashtra Time and Sakal newspaper that contains 854324 distinct words of 99,362 documents for Marathi language. While evaluating IR, author has achieved better MAP for the proposed GRAS stemmer.



Information retrieval with Hindi, Bengali, and Marathi languages: Evaluation and analysis

Savoy et al. (2013) assessed retrieval effectives of various indexing and searching strategies for 3 Indian languages including Marathi language. They have proposed the new and aggressive stemmer and compared with n-gram (here n=4) and light stemmer. For retrieval experiments, the author has extracted the articles FIRE 2008 test collection dataset that comprise 511,550 distinct terms of 99,357 documents for Marathi language. While evaluating IR, author has achieved better MAP for proposed stemmer compared with all other stemmers.
Download 12.06 Kb.

Share with your friends:




The database is protected by copyright ©ininet.org 2022
send message

    Main page