Chulalongkorn University Theses and Dissertations (Chula ETD)

Improving chinese hate speech detection with bert-fasttext fusion and BERT-BiLSTM fusion

Other Title (Parallel Title in Other Language of ETD)

การปรับปรุงการตรวจหาประทุษวาจาภาษาจีนด้วยการรวมตัวเบิร์ท-ฟาสเท็กซ์และการรวมตัวเบิร์ท-ไบแอลเอสทีเอ็ม

Methini Ma, Faculty of Science

Year (A.D.)

2025

Document Type

Thesis

First Advisor

Arthorn Luangsodsai

Second Advisor

Pakawan Pugsee

Faculty/College

Faculty of Science (คณะวิทยาศาสตร์)

Department (if any)

Department of Mathematics and Computer Science (ภาควิชาคณิตศาสตร์และวิทยาการคอมพิวเตอร์)

Degree Name

Master of Science

Degree Level

Master's Degree

Degree Discipline

Computer Science and Information Technology

DOI

10.58837/CHULA.THE.2025.213

Abstract

Hate speech detection is an essential technique in the online environment, especially on social media platforms. This technique helps to create a safer space and reduce the risk of real-world harm. In Chinese, this task is particularly challenging because of unique linguistic structures and the frequent use of indirect expressions, sarcasm, homophones, character variants, and abbreviations. This study investigates how to improve Chinese hate speech detection by combining BERT with FastText and BERT with BiLSTM. There are six model variants that are configured: frozen BERT and fine-tuned BERT, each further extended with either FastText sentence embeddings or a clause-level BiLSTM. Experiments are conducted on a self-annotated Chinese social media dataset and the public COLDataset corpus, including a cross-dataset setting where models are trained on the self-annotated data and evaluated on COLDataset. The results show that fine-tuned BERT is the main factor of performance gain, and that combining FastText or BiLSTM improves over the corresponding BERT baselines. Among all models, fine-tuned BERT combined with FastText achieves the best in-domain performance, reaching 92.58% accuracy on the self-annotated dataset, while also having strong ROC–AUC in the cross-dataset evaluation. Overall, these findings indicate that simple feature-level fusion of BERT with lexical or clause-level information is an effective and computationally practical way to improve Chinese hate speech detection.

Other Abstract (Other language abstract of ETD)

การตรวจหาประทุษวาจาเป็นเทคนิคที่มีความจำเป็นอย่างยิ่งในสภาพแวดล้อมออนไลน์ โดยเฉพาะบนแพลตฟอร์มโซเชียลมีเดีย เทคนิคนี้ช่วยสร้างพื้นที่ที่ปลอดภัยยิ่งขึ้นและลดความเสี่ยงต่ออันตรายในโลกแห่งความเป็นจริง สำหรับภาษาจีนนั้นงานนี้มีความท้าทายเป็นพิเศษ เนื่องจากโครงสร้างภาษาเฉพาะตัว รวมถึงการใช้ถ้อยคำอ้อมค้อม การประชดประชัน คำพ้องเสียง ตัวอักษรที่หลากหลาย และคำย่ออยู่บ่อยครั้ง งานวิจัยนี้ศึกษาวิธีการพัฒนาการตรวจหาประทุษวาจา โดยการรวม BERT เข้ากับ FastText และรวม BERT เข้ากับ BiLSTM มีการกำหนดรูปแบบโมเดลทั้งหมดหกรูปแบบ ได้แก่ frozen BERT และ fine-tuned BERT โดยแต่ละแบบขยายเพิ่มด้วยเวกเตอร์ของประโยคจาก FastText หรืออนุประโยคจาก BiLSTM การทดลองดำเนินการบนชุดข้อมูลโซเชียลมีเดียภาษาจีนที่จัดทำและใส่ป้ายกำกับเอง รวมถึงชุดข้อมูลสาธารณะ COLDataset นอกจากนี้ยังมีการทดลองแบบข้ามชุดข้อมูลโดยฝึกโมเดลด้วยชุดข้อมูลที่ใส่ป้ายกำกับเองและทดสอบด้วยชุดข้อมูล COLDataset ผลลัพธ์แสดงให้เห็นว่า fine-tuned BERT เป็นปัจจัยสำคัญที่ทำให้ประสิทธิภาพดีขึ้นและการรวมกับ FastText หรือ BiLSTM ยังช่วยเพิ่มคุณภาพเมื่อเทียบกับโมเดล BERT พื้นฐาน จากโมเดลทั้งหมด fine-tuned BERT ที่รวมกับ FastText ให้ประสิทธิภาพดีที่สุดภายในชุดข้อมูลเดียวกัน โดยได้ความแม่นยำ 92.58% บนชุดข้อมูลที่ใส่ป้ายกำกับเอง และยังได้ค่า ROC–AUC ที่ดีในการประเมินแบบข้ามชุดข้อมูล โดยรวมแล้วผลการศึกษาชี้ให้เห็นว่าการรวมคุณลักษณะ (feature-level fusion) ระหว่าง BERT กับข้อมูลเชิงคำศัพท์หรืออนุประโยค เป็นวิธีที่มีประสิทธิภาพและใช้งานได้จริงในการตรวจหาประทุษวาจาในภาษาจีน.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Recommended Citation

Ma, Methini, "Improving chinese hate speech detection with bert-fasttext fusion and BERT-BiLSTM fusion" (2025). Chulalongkorn University Theses and Dissertations (Chula ETD). 75209.
https://digital.car.chula.ac.th/chulaetd/75209

Download

Included in

Computer Sciences Commons

COinS

Chulalongkorn University Theses and Dissertations (Chula ETD)

Improving chinese hate speech detection with bert-fasttext fusion and BERT-BiLSTM fusion

Other Title (Parallel Title in Other Language of ETD)

Year (A.D.)

Document Type

First Advisor

Second Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Search

Browse

Author Corner

Chulalongkorn University Theses and Dissertations (Chula ETD)

Improving chinese hate speech detection with bert-fasttext fusion and BERT-BiLSTM fusion

Other Title (Parallel Title in Other Language of ETD)

Author

Year (A.D.)

Document Type

First Advisor

Second Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Share

Search

Browse

Author Corner