Chulalongkorn University Theses and Dissertations (Chula ETD)

Thai tokenizer invariant classification based on bi-lstm and distilbert encoders

Other Title (Parallel Title in Other Language of ETD)

การจำแนกที่ไม่แปรเปลี่ยนตามโทเคนไนเซอร์ภาษาไทยบนฐานของตัวเข้ารหัสไบแอลเอสทีเอ็มและดิสทิลเบิร์ต

Noppadol Kongsumran, Faculty of Science

Year (A.D.)

2021

Document Type

Thesis

First Advisor

Suphakant Phimoltares

Faculty/College

Faculty of Science (คณะวิทยาศาสตร์)

Department (if any)

Department of Mathematics and Computer Science (ภาควิชาคณิตศาสตร์และวิทยาการคอมพิวเตอร์)

Degree Name

Master of Science

Degree Level

Master's Degree

Degree Discipline

Computer Science and Information Technology

DOI

10.58837/CHULA.THE.2021.113

Abstract

Natural language processing (NLP) is a topic in artificial intelligence to teach computer to understand human language. Researchers can feed text of some particular language in any length and type such as characters, words, and sentences into the algorithm to extract a summarized context in terms of numbers. To accept a word array in Thai language, tokenization process is needed to split a text into words because each sentence is written consecutively without any space between words. In general, different tokenizers can produce different sets of words from a single sentence, resulting in uncontrolled accuracies in NLP and related tasks. In this research, a method to solve the different results from different Thai tokenizers is introduced by aligning tokenization results together in the similar direction using neural networks encoders. Bi-LSTM and DistilBERT with triplet hard loss are used to train and transform sets of words to data in a new domain where vectors of each similar sentence are significantly closer. Finally, twenty-eight classifiers are created using two types of encoders, seven different tokenizers, with and without using the proposed method for comparative and analysis purposes. To demonstrate that the proposed approach can be used as a pre-trained method for other tasks, the sentiment datasets are used to measure the classification accuracy and investigate similarities of results from all classifiers.

Other Abstract (Other language abstract of ETD)

การประมวลผลภาษาธรรมชาติเป็นหัวข้อในปัญญาประดิษฐ์เพื่อสอนคอมพิวเตอร์ให้เข้าใจภาษามนุษย์ นักวิจัยสามารถป้อนข้อความของภาษาที่เจาะจงในความยาวและประเภทใดๆ เช่น อักขระ คำ และประโยคไปยังขั้นตอนวิธี เพื่อแยกบริบทที่สรุปในรูปของตัวเลข และเพื่อให้ยอมรับอาร์เรย์ของคำในภาษาไทย กระบวนการตัดคำจึงจำเป็นใช้แยกข้อความเป็นคำเนื่องจากประโยคแต่ละประโยคเขียนต่อเนื่องกันโดยไม่มีช่องว่างระหว่างคำ โดยทั่วไปแล้วตัวตัดคำที่ต่างกันสามารถสร้างชุดคำที่ต่างกันจากประโยคเดียวได้ ส่งผลให้ไม่สามารถควบคุมความแม่นยำในการประมวลผลภาษาธรรมชาติและปัญหาที่สัมพันธ์ได้ ในงานวิจัยนี้วิธีที่ใช้แก้ปัญหาผลลัพธ์ที่แตกต่างกันจากตัวตัดคำภาษาไทยที่แตกต่างกันได้ถูกนำเสนอโดยวางแนวผลการตัดคำให้อยู่ในทิศทางเดียวกันโดยใช้ตัวเข้ารหัสโครงข่ายประสาท ไบแอลเอสทีเอ็มและดิสทิลเบิร์ต ร่วมกับค่าสูญเสียถาวรทริปเลตใช้เพื่อฝึกและแปลงชุดคำให้เป็นข้อมูลในโดเมนใหม่ที่เวกเตอร์ของแต่ละประโยคที่คล้ายกันอยู่ใกล้กันมากขึ้น ในท้ายที่สุดตัวจำแนกจำนวนยี่สิบแปดตัวถูกสร้างขึ้นมาโดยใช้ตัวเข้ารหัสสองประเภท ตัวตัดคำเจ็ดตัว โดยใช้หรือไม่ใช้วิธีที่เสนอเพื่อการเปรียบเทียบและการวิเคราะห์ และเพื่อแสดงว่าวิธีที่เสนอสามารถใช้เป็นวิธีเริ่มฝึกฝนสำหรับงานอื่นๆ ได้ ชุดข้อมูลความรู้สึกถูกใช้เพื่อวัดความแม่นการจำแนกและตรวจสอบความเหมือนของผลที่ได้จากตัวจำแนกทั้งหมด

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Recommended Citation

Kongsumran, Noppadol, "Thai tokenizer invariant classification based on bi-lstm and distilbert encoders" (2021). Chulalongkorn University Theses and Dissertations (Chula ETD). 4655.
https://digital.car.chula.ac.th/chulaetd/4655

Download

Included in

Computer Sciences Commons

COinS

Chulalongkorn University Theses and Dissertations (Chula ETD)

Thai tokenizer invariant classification based on bi-lstm and distilbert encoders

Other Title (Parallel Title in Other Language of ETD)

Year (A.D.)

Document Type

First Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Search

Browse

Author Corner

Chulalongkorn University Theses and Dissertations (Chula ETD)

Thai tokenizer invariant classification based on bi-lstm and distilbert encoders

Other Title (Parallel Title in Other Language of ETD)

Author

Year (A.D.)

Document Type

First Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Share

Search

Browse

Author Corner