Chulalongkorn University Theses and Dissertations (Chula ETD)

Semi-supervised Thai sentence segmentation using local and distant word representations

Other Title (Parallel Title in Other Language of ETD)

การตัดประโยคภาษาไทยแบบกึ่งมีผู้สอนโดยใช้ตัวแทนของคำประเภทเฉพาะที่และไกล

Chanatip Saetia, Faculty of Engineering

Year (A.D.)

2020

Document Type

Thesis

First Advisor

Peerapon Vateekul

Faculty/College

Faculty of Engineering (คณะวิศวกรรมศาสตร์)

Department (if any)

Department of Computer Engineering (ภาควิชาวิศวกรรมคอมพิวเตอร์)

Degree Name

Master of Engineering

Degree Level

Master's Degree

Degree Discipline

Computer Engineering

DOI

10.58837/CHULA.THE.2020.136

Abstract

A sentence is typically treated as the minimal syntactic unit used for extracting valuable information from a longer piece of text. However, in written Thai, there are no explicit sentence markers. We proposed a deep learning model for the task of sentence segmentation that includes three main contributions. First, we integrate n-gram embedding as a local representation to capture word groups near sentence boundaries. Second, to focus on the keywords of dependent clauses, we combine the model with a distant representation obtained from self-attention modules. Finally, due to the scarcity of labeled data, for which annotation is difficult and time-consuming, we also investigate and adapt two techniques, allowing us to utilize unlabeled data. The first one is Cross-View Training (CVT) as a semi-supervised learning technique, and the second one is a pre-trained language model (ELMo) to improve representation. In the Thai sentence segmentation experiments, our model reduced the relative error by 7.4% and 18.5% compared with the baseline models on the Orchid and UGWC datasets, respectively. We also applied our model to the task of punctuation restoration on the IWSLT English dataset. Our model outperformed the prior sequence tagging models, achieving a relative error reduction of 7.6%. Ablation studies revealed that utilizing n-gram representations was the main contributing factor for Thai, while the semi-supervised training helped the most for English.

Other Abstract (Other language abstract of ETD)

ประโยคคือหน่วยไวยากรณ์ที่มีขนาดเล็กที่สุด เพื่อที่สื่อใจความสำคัญครบถ้วนในประโยค ซึ่งช่วยในการแบ่งข้อความที่ขนาดยาวให้เป็นหน่วยที่เล็กลง อย่างไรก็ตามในภาษาไทย ไม่มีตัวแบ่งประโยคที่บ่งชี้ชัด เราจึงได้พัฒนาโมเดลการเรียนรู้เชิงลึกเพื่อการตัดประโยคจากข้อความ ซึ่งประกอบด้วยสามองค์ประกอบ อย่างแรกคือการใช้ตัวแทนข้อมูลคำข้างเคียง หรือตัวแทนข้อมูลแบบใกล้ในการจับกลุ่มคำที่อยู่ใกล้กับตัวแบ่งประโยค อย่างที่สองคือการสนใจคำที่เป็นอนุประโยคที่อยู่ด้วยตัวเองไม่ได้ โดยใช้ตัวแทนข้อมูลแบบไกลซึ่งได้จากกลไกจุดสนใจ อย่างสุดท้ายคือการใช้สองเทคนิคเพื่อใช้ประโยชน์จากข้อมูลที่ไม่มีการกำกับข้อมูล เนื่องจากข้อมูลที่มีการกำกับข้อมูลนั้นมีน้อย และยังยากและต้องการเวลาในการกำกับข้อมูล โดยเทคนิคแรกคือการสอนแบบหลายมุมมอง ซึ่งเป็นการเรียนรู้กึ่งมีผู้สอน และเทคนิคที่สองคือการใช้โมเดลภาษาแบบถูกสอนมาก่อนเพื่อพัฒนาตัวแทนของข้อมูล ในการทดลองของการตัดคำภาษาไทย โมเดลของเราสามารถลดความผิดพลาดสัมพัทธ์ลง 7.4% และ 18.5% เมื่อเปรียบเทียบกับโมเดลก่อนหน้า เมื่อเทียบบนชุดข้อมูล Orchid และ UGWC ตามลำดับ เรายังได้ทดสอบกับงานที่ใกล้เคียงกันบนภาษาอังกฤษ คือการทำนายเครื่องหมายวรรคตอนที่หายไป โดยโมเดลของเราสามารถลดความผิดพลาดสัมพัทธ์เมื่อเทียบกับโมเดลก่อนหน้าลง 7.6% จากศึกษาพบว่าการใช้ตัวแทนข้อมูลจากคำใกล้เคียงเป็นปัจจัยหลักในการพัฒนาขึ้นบนภาษาไทย ในขณะที่ในภาษาอังกฤษการเรียนรู้กึ่งมีผู้สอนเป็นปัจจัยหลักในการทำให้โมเดลดีขึ้น

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Recommended Citation

Saetia, Chanatip, "Semi-supervised Thai sentence segmentation using local and distant word representations" (2020). Chulalongkorn University Theses and Dissertations (Chula ETD). 126.
https://digital.car.chula.ac.th/chulaetd/126

Download

Included in

Artificial Intelligence and Robotics Commons, Computer Engineering Commons

COinS

Chulalongkorn University Theses and Dissertations (Chula ETD)

Semi-supervised Thai sentence segmentation using local and distant word representations

Other Title (Parallel Title in Other Language of ETD)

Year (A.D.)

Document Type

First Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Search

Browse

Author Corner

Chulalongkorn University Theses and Dissertations (Chula ETD)

Semi-supervised Thai sentence segmentation using local and distant word representations

Other Title (Parallel Title in Other Language of ETD)

Author

Year (A.D.)

Document Type

First Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Share

Search

Browse

Author Corner