Chulalongkorn University Theses and Dissertations (Chula ETD)

Thai language sentiment analysis with a hybrid method on WangchanBERTa-CNN-BiLSTM

Other Title (Parallel Title in Other Language of ETD)

การจำแนกอารมณ์จากข้อความภาษาไทยด้วยวิธีการแบบลูกผสมบนแบบจำลอง WangchanBERTa-CNN-BiLSTM

Kasidhit Suraratchai, Faculty of Commerce and Accountancy

Year (A.D.)

2023

Document Type

Thesis

First Advisor

Suronapee Phoomvuthisarn

Faculty/College

Faculty of Commerce and Accountancy (คณะพาณิชยศาสตร์และการบัญชี)

Department (if any)

Department of Statistics (ภาควิชาสถิติ)

Degree Name

Master of Science

Degree Level

Master's Degree

Degree Discipline

Statistics and Data Science

DOI

10.58837/CHULA.THE.2023.1363

Abstract

Understanding emotions conveyed in text, especially in non-global languages such as Thai, sentiment analysis is particularly important in Thailand. However, this endeavor faces challenges due to variations in text length, which significantly impact sentiment analysis outcomes. Previous research has employed neural network and machine learning models in the process from research [2] and [7], yet each model specializes in different aspects, making comprehensive sentiment analysis coverage unattainable. Recent research [13], has delved into hybrid models like CNN-BiLSTM and BiLSTM-CNN. Although they demonstrate efficacy, their performance still varies across different datasets. For instance, CNN-BiLSTM excels with short sentences by considering surrounding word context, while BiLSTM-CNN is more effective with long sentences due to its bidirectional learning capability. While showing promise, these models perform effectively, but varied text lengths in datasets often lead to sentiment misinterpretation. To address these challenges, inspired by research [15] we propose an innovative solution the Parallel Hybrid model. This approach integrates WangchanBERTa into both CNN-BiLSTM and BiLSTM-CNN architectures, harnessing ensemble techniques to improve overall performance and adaptability. Our experiments, conducted on datasets like Wisesight, a highly imbalanced dataset with mostly longer texts, and Thai Children's Tales, a less imbalanced dataset with mostly shorter texts, confirm the effectiveness of the Parallel Hybrid model, which outperforms other model configurations with Macro F1 scores of 0.6270 and 0.7859, respectively. This research marks a significant advancement in sentiment analysis for the Thai language.

Other Abstract (Other language abstract of ETD)

การทำความเข้าใจอารมณ์ที่ถ่ายทอดออกมาจากข้อความ โดยเฉพาะในภาษาที่ไม่ใช่สากล เช่น ภาษาไทย การวิเคราะห์ความรู้สึกมีความสำคัญอย่างยิ่งในประเทศไทย อย่างไรก็ตาม การวิเคราะห์ความรู้สึกเผชิญกับความท้าทายเนื่องจากความยาวของข้อความที่แตกต่างกัน ซึ่งส่งผลกระทบอย่างมากต่อผลลัพธ์การวิเคราะห์ความรู้สึก การวิจัยก่อนหน้านี้ได้ใช้โมเดลโครงข่ายประสาทเทียมและการเรียนรู้ของเครื่องในกระบวนการจากงานวิจัย [2] และ [7] แต่ละโมเดลมีความเชี่ยวชาญในแง่มุมที่แตกต่างกัน ทำให้ไม่สามารถครอบคลุมการวิเคราะห์ความรู้สึกทั้งหมดได้ งานวิจัยล่าสุด [13] ได้เจาะลึก Hybrid Model เช่น CNN-BiLSTM และ BiLSTM-CNN แม้ว่าจะแสดงให้เห็นถึงประสิทธิภาพ แต่ประสิทธิภาพยังคงแตกต่างกันไปตามชุดข้อมูลต่างๆ ตัวอย่างเช่น CNN-BiLSTM เก่งกว่าด้วยประโยคสั้นโดยพิจารณาบริบทของคำโดยรอบ ในขณะที่ BiLSTM-CNN มีประสิทธิภาพมากกว่าเมื่อใช้ประโยคยาวเนื่องจากความสามารถในการเรียนรู้แบบสองทิศทาง แม้โมเดลเหล่านี้ก็ทำงานได้อย่างมีประสิทธิภาพ แต่ความยาวของข้อความที่แตกต่างกันในชุดข้อมูลมักจะนำไปสู่การตีความความรู้สึกที่ผิด เพื่อจัดการกับความท้าทายเหล่านี้ โดยได้รับแรงบันดาลใจจากงานวิจัย [15] เราจึงนำเสนอวิธีแก้ปัญหาที่ในรุ่น Parallel Hybrid แนวทางนี้รวม WangchanBERTa เข้ากับสถาปัตยกรรม CNN-BiLSTM และ BiLSTM-CNN โดยใช้เทคนิคทั้งมวลเพื่อปรับปรุงประสิทธิภาพโดยรวมและความสามารถในการปรับตัว การทดลองของเราดำเนินการกับชุดข้อมูล Wisesight ซึ่งเป็นชุดข้อมูลที่มีความไม่สมดุลอย่างมากซึ่งมีข้อความยาวเป็นส่วนใหญ่ และ 40 Thai Children's Tales ซึ่งเป็นชุดข้อมูลที่ไม่สมดุลน้อยกว่าและมีข้อความสั้นเป็นส่วนใหญ่ โดยได้รับผลยืนยันประสิทธิภาพของโมเดล Parallel Hybrid ซึ่งมีประสิทธิภาพเหนือกว่าการกำหนดค่าโมเดลอื่นๆ ด้วยคะแนน Macro F1 เท่ากับ 0.6270 และ 0.7859 ตามลำดับ งานวิจัยนี้ถือเป็นความก้าวหน้าที่สำคัญในการวิเคราะห์ความรู้สึกของภาษาไทย

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Recommended Citation

Suraratchai, Kasidhit, "Thai language sentiment analysis with a hybrid method on WangchanBERTa-CNN-BiLSTM" (2023). Chulalongkorn University Theses and Dissertations (Chula ETD). 73869.
https://digital.car.chula.ac.th/chulaetd/73869

Download

Included in

Statistics and Probability Commons

COinS

Chulalongkorn University Theses and Dissertations (Chula ETD)

Thai language sentiment analysis with a hybrid method on WangchanBERTa-CNN-BiLSTM

Other Title (Parallel Title in Other Language of ETD)

Year (A.D.)

Document Type

First Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Search

Browse

Author Corner

Chulalongkorn University Theses and Dissertations (Chula ETD)

Thai language sentiment analysis with a hybrid method on WangchanBERTa-CNN-BiLSTM

Other Title (Parallel Title in Other Language of ETD)

Author

Year (A.D.)

Document Type

First Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Share

Search

Browse

Author Corner