dc.contributor.author |
Aththanayaka, P.M.I.U. |
|
dc.contributor.author |
Naleer, H M.M. |
|
dc.date.accessioned |
2021-01-06T04:42:58Z |
|
dc.date.available |
2021-01-06T04:42:58Z |
|
dc.date.issued |
2020-10-25 |
|
dc.identifier.citation |
9th Annual Science Research Sessions - 2020, pp. 24. |
en_US |
dc.identifier.isbn |
928-955-627-250-5 |
|
dc.identifier.uri |
http://ir.lib.seu.ac.lk/handle/123456789/5194 |
|
dc.description.abstract |
Sinhala is a morphologically rich but low resourced language for computer-based natural language processing. Since the introduction of Unicode character set for Sinhala, considerable growth of Sinhala textual web contents can be observed. With the rapid popularity of social media in sri Lanka this growth can be also seen in the social media contents also. Social media comments are frequently code mixed in Sinhala and English and consists of Singlish terms (Sinhala words written in Roman Script). Therefore, performing sentiment analysis on such document considering only Sinhala would be inaccurate since there may be content in English or Singlish which may contribute to the sentimental value of the content. To overcome this challenge a model will be built through this study using social media comments which will be able to identify the correct language of the terms and perform sentiment analysis considering the whole content. In this study, using YouTube platform 500 code-mixed comments were extracted and they were labeled manually as Positive, Negative or Neutral. After preprocessing steps such as emoji removal, stop word, URL and Special symbol removal, comments were tokenized separately based on their character set, Roman scripts were tokenized into separate list to identify the Singlish words. Roman script token list was stemmed and lemmatized using Natural Language Toolkit (NLTK) library and compared with an English word list to recognize English words, rest of the words are considered Sinhala and transliteration is performed to Sinhala script and Singlish to Sinhala dictionary is created. Singlish words on comments were replaced using the dictionary created. Sinhala words were stemmed based on shallow learning method and English words using NLTK library. After the
prepossessing and transliteration stage feature extraction is performed using various techniques and Supervised Machine Learning method such as Random forest, Support vector machine and Multinomial Naïve Bayes were used for classification the Sentiment Analysis. In the Proposed methodology Transliteration was accurate up to 72% and Random Forest classifier gave highest accuracy which is 75%. |
en_US |
dc.language.iso |
en_US |
en_US |
dc.publisher |
Faculty of Applied Science, South Eastern university of Sri Lanka. |
en_US |
dc.subject |
Sinhala Sentiment Analysis, |
en_US |
dc.subject |
Code-mixed comments, |
en_US |
dc.subject |
Social media, |
en_US |
dc.subject |
Transliteration, |
en_US |
dc.title |
Sentimental analysis of comments in social media in Sinhala - English code-mixed language using supervised learning techniques |
en_US |
dc.type |
Article |
en_US |