Please use this identifier to cite or link to this item: http://ir.lib.seu.ac.lk/handle/123456789/5994
Title: Towards stop words identification in Tamil text clustering
Authors: Faathima Fayaza, M. S.
Fathima Farhath, F.
Keywords: Stopwords
Tamil
Pre-processing
TF-IDF
Clustering
Issue Date: 2021
Publisher: The Science and Information Organization
Citation: International Journal of Advanced Computer Science and Applications, Vol. 12, No. 12, 2021; p. 524-529.
Abstract: Now-a-days, digital documents have become the primary source of information. Therefore, natural language processing is widely utilized in information retrieval, topic modeling, document classification, and document clustering. Preprocessing plays a significant role in all of these applications. One of the critical steps in preprocessing is removing stopwords. Many languages have defined their list of stopwords. However, a publicly available stopwords list isn't available for the Tamil language since it is under-resourced. This study identified 93 general and some domain-specific stopwords for sports, entertainment, local and foreign news by analyzing more than 1.7 million Tamil documents with more than 21 million words. Also, this study shows that removing stopwords improves the accuracy of a Tamil document clustering system. It showed an improvement of 2.4%, 0.95% in the F-score for TF-IDF with one pass algorithm and FastText with the one-pass algorithm, respectively.
URI: http://ir.lib.seu.ac.lk/handle/123456789/5994
ISSN: 2156-5570
Appears in Collections:Research Articles

Files in This Item:
File Description SizeFormat 
Paper_67-Towards_Stopwords_Identification.pdf626.62 kBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.