SEUIR Repository

Sentimental analysis of comments in social media in Sinhala - English code-mixed language using supervised learning techniques

Show simple item record

dc.contributor.author Aththanayaka, P.M.I.U.
dc.contributor.author Naleer, H M.M.
dc.date.accessioned 2021-01-06T04:42:58Z
dc.date.available 2021-01-06T04:42:58Z
dc.date.issued 2020-10-25
dc.identifier.citation 9th Annual Science Research Sessions - 2020, pp. 24. en_US
dc.identifier.isbn 928-955-627-250-5
dc.identifier.uri http://ir.lib.seu.ac.lk/handle/123456789/5194
dc.description.abstract Sinhala is a morphologically rich but low resourced language for computer-based natural language processing. Since the introduction of Unicode character set for Sinhala, considerable growth of Sinhala textual web contents can be observed. With the rapid popularity of social media in sri Lanka this growth can be also seen in the social media contents also. Social media comments are frequently code mixed in Sinhala and English and consists of Singlish terms (Sinhala words written in Roman Script). Therefore, performing sentiment analysis on such document considering only Sinhala would be inaccurate since there may be content in English or Singlish which may contribute to the sentimental value of the content. To overcome this challenge a model will be built through this study using social media comments which will be able to identify the correct language of the terms and perform sentiment analysis considering the whole content. In this study, using YouTube platform 500 code-mixed comments were extracted and they were labeled manually as Positive, Negative or Neutral. After preprocessing steps such as emoji removal, stop word, URL and Special symbol removal, comments were tokenized separately based on their character set, Roman scripts were tokenized into separate list to identify the Singlish words. Roman script token list was stemmed and lemmatized using Natural Language Toolkit (NLTK) library and compared with an English word list to recognize English words, rest of the words are considered Sinhala and transliteration is performed to Sinhala script and Singlish to Sinhala dictionary is created. Singlish words on comments were replaced using the dictionary created. Sinhala words were stemmed based on shallow learning method and English words using NLTK library. After the prepossessing and transliteration stage feature extraction is performed using various techniques and Supervised Machine Learning method such as Random forest, Support vector machine and Multinomial Naïve Bayes were used for classification the Sentiment Analysis. In the Proposed methodology Transliteration was accurate up to 72% and Random Forest classifier gave highest accuracy which is 75%. en_US
dc.language.iso en_US en_US
dc.publisher Faculty of Applied Science, South Eastern university of Sri Lanka. en_US
dc.subject Sinhala Sentiment Analysis, en_US
dc.subject Code-mixed comments, en_US
dc.subject Social media, en_US
dc.subject Transliteration, en_US
dc.title Sentimental analysis of comments in social media in Sinhala - English code-mixed language using supervised learning techniques en_US
dc.type Article en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search SEUIR


Advanced Search

Browse

My Account