Chulalongkorn University Theses and Dissertations (Chula ETD)

การสังเคราะห์ข้อความเพื่อเพิ่มตัวอย่างการตรวจจับข้อความประทุษวาจาในข้อความภาษาไทย

Other Title (Parallel Title in Other Language of ETD)

Text synthesis to add an example for detecting hate speech in Thai massages

ธโนภาส วรรณวโรทร, คณะวิศวกรรมศาสตร์

Year (A.D.)

2021

Document Type

Thesis

First Advisor

สุกรี สินธุภิญโญ

Faculty/College

Faculty of Engineering (คณะวิศวกรรมศาสตร์)

Department (if any)

Department of Computer Engineering (ภาควิชาวิศวกรรมคอมพิวเตอร์)

Degree Name

วิทยาศาสตรมหาบัณฑิต

Degree Level

ปริญญาโท

Degree Discipline

วิทยาศาสตร์คอมพิวเตอร์

DOI

10.58837/CHULA.THE.2021.852

Abstract

ในงานวิจัยนี้เป็นการศึกษาวิธีการแก้ไขปัญหาในการจำแนกข้อความประทุษวาจา ด้วยวิธีการสังเคราะห์ข้อความขึ้นเพื่อแก้ไขปัญหาของการเกิดชุดข้อมูลไม่สมดุลที่ปรากฏในข้อมูลที่เก็บรวบรวมมาจากทวิตเตอร์ ซึ่งหลังจากเก็บรวบรวม ทำความสะอาดข้อมูลและติดฉลากข้อมูลแล้ว ผู้วิจัยได้สร้างตัวอย่างเพิ่มเติม 3 วิธีคือ คือ 1. การสุ่มตัวอย่างส่วนน้อยเพิ่มด้วยการสังเคราะห์ (Synthetic Minority Over-sampling Technique: SMOTE) 2. เทคนิคการสร้างข้อความเพิ่ม (Text generation) 3.เทคนิคคำฝังตัว (Word Embedding) เป็นวิธีการในการใช้สังเคราะห์ตัวอย่างเพิ่มเติม ให้เกิดความสมดุลก่อนที่จะนำข้อมูลชุดใหม่ที่สร้างขึ้นใหม่แบ่ง ตัวอย่างเป็น 3 รูปแบบในการจำแนกข้อความประทุษวาจา คือ 1. อัลกอริทึมนาอีฟเบย์ (Navie bays) 2. หน่วยความจำระยะสั้นแบบยาว (LSTM) 3. หน่วยความจำระยะสั้นแบบยาว ร่วมกับ โครงข่ายประสาทแบบคอนโวลูชัน (LSTM + CNN) เพื่อเป็นการจำแนกข้อความประทุษวาจา ในชุดข้อความที่เป็นข้อความธรรมดา โดยผลการทดลองการจำแนกข้อความมีความหมายเชิงประทุษวาจา ซึ่งในการทดลองแรกได้ลองใช้ข้อมูลที่ไม่สมดุล จากผลการทดลองทั้ง 3 รูปแบบที่ใช้ในการจำแนกซึ่งให้ความถูกต้องไม่สูงเท่าที่ควร จากนั้นจึงทำการแก้ไขปัญหาในชุดของข้อมูลทำให้ได้ความถูกต้องสูงขึ้นในทุกชุดของทุกโมเดล

Other Abstract (Other language abstract of ETD)

In this paper, we present a method for solving a problem in classifying text messages containing Hate Speech by synthesizing messages to solve the problem of the imbalance in text corpuses that were collected from Twitter. After collecting, cleansing, and labeling the data, we augmented samples using three methods, namely 1) Synthetic Minority Over-sampling Technique (SMOTE), 2) Text generation technique, and 3) Word Embedding. In this research, we used three text classification techniques: Naive Bayes, Long Short-Term Memory (LSTM), and a combination of Long Short-Term Memory and Convolutional Neural Network (CNN). The accuracy of the text classification on imbalanced text data was not high. However, after we added the text from minority class to the training set, the accuracy become higher in all classification models.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Recommended Citation

วรรณวโรทร, ธโนภาส, "การสังเคราะห์ข้อความเพื่อเพิ่มตัวอย่างการตรวจจับข้อความประทุษวาจาในข้อความภาษาไทย" (2021). Chulalongkorn University Theses and Dissertations (Chula ETD). 5394.
https://digital.car.chula.ac.th/chulaetd/5394

Download

Included in

Computer Sciences Commons

COinS

Chulalongkorn University Theses and Dissertations (Chula ETD)

การสังเคราะห์ข้อความเพื่อเพิ่มตัวอย่างการตรวจจับข้อความประทุษวาจาในข้อความภาษาไทย

Other Title (Parallel Title in Other Language of ETD)

Year (A.D.)

Document Type

First Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Search

Browse

Author Corner

Chulalongkorn University Theses and Dissertations (Chula ETD)

การสังเคราะห์ข้อความเพื่อเพิ่มตัวอย่างการตรวจจับข้อความประทุษวาจาในข้อความภาษาไทย

Other Title (Parallel Title in Other Language of ETD)

Author

Year (A.D.)

Document Type

First Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Share

Search

Browse

Author Corner