Sentimental analysis of comments in social media in Sinhala - English code-mixed language using supervised learning techniques

Aththanayaka, P.M.I.U.; Naleer, H M.M.

dc.contributor.author	Aththanayaka, P.M.I.U.
dc.contributor.author	Naleer, H M.M.
dc.date.accessioned	2021-01-06T04:42:58Z
dc.date.available	2021-01-06T04:42:58Z
dc.date.issued	2020-10-25
dc.identifier.citation	9th Annual Science Research Sessions - 2020, pp. 24.	en_US
dc.identifier.isbn	928-955-627-250-5
dc.identifier.uri	http://ir.lib.seu.ac.lk/handle/123456789/5194
dc.description.abstract	Sinhala is a morphologically rich but low resourced language for computer-based natural language processing. Since the introduction of Unicode character set for Sinhala, considerable growth of Sinhala textual web contents can be observed. With the rapid popularity of social media in sri Lanka this growth can be also seen in the social media contents also. Social media comments are frequently code mixed in Sinhala and English and consists of Singlish terms (Sinhala words written in Roman Script). Therefore, performing sentiment analysis on such document considering only Sinhala would be inaccurate since there may be content in English or Singlish which may contribute to the sentimental value of the content. To overcome this challenge a model will be built through this study using social media comments which will be able to identify the correct language of the terms and perform sentiment analysis considering the whole content. In this study, using YouTube platform 500 code-mixed comments were extracted and they were labeled manually as Positive, Negative or Neutral. After preprocessing steps such as emoji removal, stop word, URL and Special symbol removal, comments were tokenized separately based on their character set, Roman scripts were tokenized into separate list to identify the Singlish words. Roman script token list was stemmed and lemmatized using Natural Language Toolkit (NLTK) library and compared with an English word list to recognize English words, rest of the words are considered Sinhala and transliteration is performed to Sinhala script and Singlish to Sinhala dictionary is created. Singlish words on comments were replaced using the dictionary created. Sinhala words were stemmed based on shallow learning method and English words using NLTK library. After the prepossessing and transliteration stage feature extraction is performed using various techniques and Supervised Machine Learning method such as Random forest, Support vector machine and Multinomial Naïve Bayes were used for classification the Sentiment Analysis. In the Proposed methodology Transliteration was accurate up to 72% and Random Forest classifier gave highest accuracy which is 75%.	en_US
dc.language.iso	en_US	en_US
dc.publisher	Faculty of Applied Science, South Eastern university of Sri Lanka.	en_US
dc.subject	Sinhala Sentiment Analysis,	en_US
dc.subject	Code-mixed comments,	en_US
dc.subject	Social media,	en_US
dc.subject	Transliteration,	en_US
dc.title	Sentimental analysis of comments in social media in Sinhala - English code-mixed language using supervised learning techniques	en_US
dc.type	Article	en_US