Abstract:
Now-a-days, digital documents have become the
primary source of information. Therefore, natural language
processing is widely utilized in information retrieval, topic
modeling, document classification, and document clustering.
Preprocessing plays a significant role in all of these applications.
One of the critical steps in preprocessing is removing stopwords.
Many languages have defined their list of stopwords. However, a
publicly available stopwords list isn't available for the Tamil
language since it is under-resourced. This study identified 93
general and some domain-specific stopwords for sports,
entertainment, local and foreign news by analyzing more than 1.7
million Tamil documents with more than 21 million words. Also,
this study shows that removing stopwords improves the accuracy
of a Tamil document clustering system. It showed an
improvement of 2.4%, 0.95% in the F-score for TF-IDF with one
pass algorithm and FastText with the one-pass algorithm,
respectively.