Abstract:
Information retrieval is one of the major tasks in natural language processing applications. In
digitalized world, there is a development of retrieval information from online platforms and
there are abundant of information for a specific subject available in online. With the hustle and
bustle, readers need to know whether the information is important according to their need
within a very short time. Automated text summarization plays a key role in natural language
processing applications. Many studies have been explored for summarizing different languages
like English, Bengali, Hausa, Chinese, Hindi, etc. However, the local language like Sinhala is
still in beginning stage. On the other hand, as a diverse country, there is a community and
language diversity in Sri Lanka. Therefore, there are people who have less fluency in Sinhala
as their mother-tongue is another local language like Tamil. Social media like Facebook
provides platform for translation of content in a different language. However, other online
platforms do not provide such translation process of the content. In such scenario, having a
short summary of those articles would be an advantageous step for the readers who can easily
understand the main idea of the content. Therefore, this work aims to generate an online
platform that can provide a good summary for Sinhala language online articles. This research
investigates extractive text summarization for Sinhala online articles using some state-of-the art algorithms in NLP applications to select a best suitable method. This work comparatively
analyses the performance of TF-IDF (Term Frequency-Inverse Document Frequency) and
Text-Rank algorithms for Sinhala language. Performance of the algorithms is evaluated with
human generated summary from online sources using ROUGE (Recall Oriented Understudy
of Gisting Evaluation) where high ROUGE score (Measure the rate of n-gram overlapping of
original text and automated summary) values represent the more accurate automated summary
of the article. From the results, the TF-IDF algorithm comparatively performs better for Sinhala
online article summarization with medium content size.