Chulalongkorn University Theses and Dissertations (Chula ETD)

A robust system for core thai natural language processing technologies

Other Title (Parallel Title in Other Language of ETD)

ระบบแบบทนทานสำหรับเทคโนโลยีหลักในการประมวลผลภาษาธรรมชาติภาษาไทย

Can Udomcharoenchaikit, Faculty of Engineering

Year (A.D.)

2020

Document Type

Thesis

First Advisor

Peerapon Vateekul

Second Advisor

Prachya Boonkwan

Faculty/College

Faculty of Engineering (คณะวิศวกรรมศาสตร์)

Department (if any)

Department of Computer Engineering (ภาควิชาวิศวกรรมคอมพิวเตอร์)

Degree Name

Doctor of Philosophy

Degree Level

Doctoral Degree

Degree Discipline

Computer Engineering

DOI

10.58837/CHULA.THE.2020.127

Abstract

As the amount of unstructured textual data grows, it becomes increasingly important to build an intelligent system that can process it. Natural Language Processing (NLP) is a technology that allows a computer to exploit human languages to perform tasks. Deep learning models have shown excellent results across fundamental tasks in NLP, such as word segmentation, part-of-speech tagging, and named-entity recognition. However, in many situations, these proposed methods fail to perform well. For an NLP system to be robust, it must address issues such as out-of-vocabulary and spelling-mistakes. This thesis's research goal is to develop NLP models that can handle malformed texts to improve their real-world setting usability. In this thesis, I propose novel models and evaluations that focus on robustness against malformed texts. This dissertation proposes multiple novel training strategies and architectures to improve the robustness against malformed texts. This thesis explores input data manipulation strategies that diversify training data, such as UNK masking and adversarial training. It explores how sub-lexical information can improve the robustness of word embeddings. Furthermore, it examines similarity constraint techniques, such as triplet loss, which constraint the similarity between the original texts and the parallel perturbed texts. I also propose alternative evaluation schemes that reveal the weaknesses of NLP systems by introducing typographical adversarial examples to the test sets. Our adversarial evaluation schemes show that current deep learning models are not robust against misspelled inputs, and they also show that our proposed training strategies and architectures can improve the performance over malformed texts.

Other Abstract (Other language abstract of ETD)

เมื่อข้อมูลที่เป็นข้อความภาษามีจำนวนมากขึ้นการสร้างระบบอัจฉริยะที่สามารถประมวลผลภาษามนุษย์ได้จึงมีความสำคัญมากขึ้น ระบบประมวลผลภาษาธรรมชาติเป็นเทคโนโลยีที่ช่วยให้คอมพิวเตอร์ใช้ประโยชน์จากภาษาของมนุษย์เพื่อทำงานต่าง ๆ จึงมีความจำเป็นมากขึ้น โมเดลการเรียนรู้เชิงลึกได้แสดงผลลัพธ์ที่ยอดเยี่ยมในงานพื้นฐานในการประมวลผลภาษาธรรมชาติ เช่น การตัดคำ การจำแนกชนิดของคำ และการรู้จำชื่อเฉพาะ อย่างไรก็ตามในบาง สถานการณ์วิธีการที่เสนอเหล่านี้ไม่สามารถทำงานได้ดีเท่าที่ควร เพื่อให้ระบบประมวลผลภาษาธรรมชาติมีเสถียรภาพมากขึ้น เราควรแก้ไขปัญหาที่ปรากฏขึ้นบ่อยครั้ง และมัอิทธิพลต่อประสิทธิภาพของระบบ ได้แก่ ปัญหาการรับมือกับคำศัพท์ที่ไม่เคยพบและคำสะกดผิด เป้าหมายการวิจัยของวิทยานิพนธ์นี้คือการพัฒนาแบบระบบประมวลผลภาษาธรรมชาติที่สามารถจัดการกับข้อความที่สะกดผิดเพื่อปรับปรุงโมเดลให้ใช้งานได้ดีขึ้นเมื่อนำไปใช้จริง วิทยานิพนธ์นี้เสนอโมเดลการเรียนรู้ของเครื่องและการประเมินผลแบบใหม่ที่มุ่งเน้นไปที่การเพิ่มความทนทานต่อข้อความที่มีการสะกดผิดรูปแบบ วิทยานิพนธ์ฉบับนี้เสนอกลยุทธ์และระบบประมวลผลภาษาธรรมชาติใหม่ เพื่อปรับปรุงความทนทานต่อคำสะกดผิด วิทยานิพนธ์นี้สำรวจกลยุทธ์การจัดการข้อมูลอินพุตที่ทำให้ข้อมูลอินพุตมีความหลากหลายมากขึ้น เช่นการใส่หน้ากากคำที่ไม่เคยพบ (UNK Masking) และการฝึกปรปักษ์ (Adversarial Training) วิทยานิพนธ์ฉบับนี้สำรวจว่าหน่วยของภาษาที่เล็กกว่าคำสามารถปรับปรุงความแข็งแกร่งของการฝังคำได้อย่างไร นอกจากนี้ยังตรวจสอบเทคนิคการ จำกัดความคล้ายคลึงกันระหว่างข้อความเช่นการใช้ฟังก์ชันการสูญเสียแบบชุดสาม (Triplet Loss) เพื่อจำกัดความคล้ายคลึงกันระหว่างข้อความต้นฉบับกับข้อความที่สะกดผิด นอกจากนี้ยังเสนอรูปแบบการประเมินแบบใหม่ที่เปิดเผยจุดอ่อนของระบบประมวลผลภาษาธรรมชาติ โดยการใส่ตัวอย่างปรปักษ์ (Adversarial Examples) จากการพิมพ์ผิดลงไปในชุดข้อมูลสำหรับทดสอบ แผนการประเมินแบบปรปักษ์ (Adversarial Evaluation) ที่ได้เสนอในวิทยานิพนธ์ฉบับนี้แสดงให้เห็นว่าแบบจำลองการเรียนรู้เชิงลึกในปัจจุบันไม่ทนทานเมื่อเจอข้อมูลที่สะกดผิดและยังแสดงให้เห็นว่ากลยุทธ์และสถาปัตยกรรมระบบประมวลผลภาษาธรรมชาติของเราสามารถปรับปรุงประสิทธิภาพได้เมื่อเจอข้อความที่มีการสะกดผิด

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Recommended Citation

Udomcharoenchaikit, Can, "A robust system for core thai natural language processing technologies" (2020). Chulalongkorn University Theses and Dissertations (Chula ETD). 125.
https://digital.car.chula.ac.th/chulaetd/125

Download

Included in

Artificial Intelligence and Robotics Commons, Computer Engineering Commons

COinS

Chulalongkorn University Theses and Dissertations (Chula ETD)

A robust system for core thai natural language processing technologies

Other Title (Parallel Title in Other Language of ETD)

Year (A.D.)

Document Type

First Advisor

Second Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Search

Browse

Author Corner

Chulalongkorn University Theses and Dissertations (Chula ETD)

A robust system for core thai natural language processing technologies

Other Title (Parallel Title in Other Language of ETD)

Author

Year (A.D.)

Document Type

First Advisor

Second Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Share

Search

Browse

Author Corner