Chulalongkorn University Theses and Dissertations (Chula ETD)

การจับคู่ประโยคที่ตรงกันในคลังข้อความขนานด้วยอนุกรมเวลา

Other Title (Parallel Title in Other Language of ETD)

Sentence alignment in parallel text corpora using time series

ศิรินันท์ สินธุวาทิน, คณะวิศวกรรมศาสตร์

Year (A.D.)

2007

Document Type

Thesis

First Advisor

โชติรัตน์ รัตนามหัทธนะ

Faculty/College

Faculty of Engineering (คณะวิศวกรรมศาสตร์)

Degree Name

วิทยาศาสตรมหาบัณฑิต

Degree Level

ปริญญาโท

Degree Discipline

วิทยาศาสตร์คอมพิวเตอร์

DOI

10.58837/CHULA.THE.2007.1180

Abstract

ในปัจจุบันโปรแกรมประยุกต์ที่พัฒนาจากคลังข้อความขนานมีเพิ่มมากขึ้นเรื่อย ๆ โดยเฉพาะอย่างยิ่งในด้านการค้นคืนข้ามภาษา การแปลภาษาด้วยเครื่องและมนุษย์ และการประมวลผลภาษาธรรมชาติ ทำให้การประมวลผลคลังข้อความขนานกลายเป็นเรื่องที่นักวิจัยให้ความสนใจมากขึ้น ในงานวิจัยนี้นำเสนอกลวิธีในการจับคู่ประโยคที่ตรงกันในคลังข้อความขนาน โดยใช้อนุกรมเวลาซึ่งจะเก็บข้อมูลเกี่ยวกับความถี่และตำแหน่งของคำที่ปรากฏในคลังข้อความขนานสองภาษาใด ๆ และทำการจับคู่คำโดยการวัดความเหมือนกันของอนุกรมเวลา วิธีนี้มีข้อดีคือ ไม่ต้องใช้ความรู้ทางภาษาศาสตร์ เช่น ไวยากรณ์ วากยสัมพันธ์ โครงสร้างประโยค และการแปลจากพจนานุกรม เป็นต้น อย่างไรก็ตาม แม้ว่าคำที่เป็นคำเดียวกันในคลังข้อความขนานหลายภาษามักจะมีความถี่และตำแหน่งของการปรากฏคล้ายกัน ทำให้สามารถจับคู่ประโยคโดยใช้คำเหล่านี้เป็นตัวบ่งชี้ได้ แต่ก็ยังมีคำอีกเป็นจำนวนมากที่ไม่สามารถจับคู่คำด้วยวิธีนี้ได้ จากการทดลองพบว่าวิธีนี้เป็นประโยชน์และให้ผลดีกับข้อความขนานขนาดสั้นประมาณ 1 หน้ามากกว่าข้อความขนาดยาว เมื่อทดลองกับข้อความขนาดสั้นโดยใช้ฟังก์ชันระยะห่างแบบแมนฮัตตัน ความถูกต้องเฉลี่ยคิดเป็น 58 เปอร์เซ็นต์

Other Abstract (Other language abstract of ETD)

As applications based on parallel corpora (parallel text) has increasingly expanded, especially in the areas of cross-language informational retrieval, machine/human translation, natural language processing, and multilingual lexicography, parallel-text processing has become the heart of the development. In this research, we propose a novel sentence alignment technique. We exploit a notion of time series representation, recording the position and frequency of word appearance, without any requirement of any linguistic knowledge, e.g. grammar/syntax, sentence structure, dictionary lookup, etc. We align word by using similarity measurement and the result of word alignment will be subsequently used for sentence alignment. Our intuition lies in the belief that similar words in any multilingual parallel text should possess similar frequency and the position of word occurrences. However, the experiment results have revealed several limitations of the method, where its utility and effectiveness seem to work better with short parallel text about 1 page. The experiment result on short parallel text by using manhattan distance gives an accuracy of 58 percent.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Recommended Citation

สินธุวาทิน, ศิรินันท์, "การจับคู่ประโยคที่ตรงกันในคลังข้อความขนานด้วยอนุกรมเวลา" (2007). Chulalongkorn University Theses and Dissertations (Chula ETD). 66679.
https://digital.car.chula.ac.th/chulaetd/66679

Link to Full Text

COinS

Chulalongkorn University Theses and Dissertations (Chula ETD)

การจับคู่ประโยคที่ตรงกันในคลังข้อความขนานด้วยอนุกรมเวลา

Other Title (Parallel Title in Other Language of ETD)

Year (A.D.)

Document Type

First Advisor

Faculty/College

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Search

Browse

Author Corner

Chulalongkorn University Theses and Dissertations (Chula ETD)

การจับคู่ประโยคที่ตรงกันในคลังข้อความขนานด้วยอนุกรมเวลา

Other Title (Parallel Title in Other Language of ETD)

Author

Year (A.D.)

Document Type

First Advisor

Faculty/College

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Share

Search

Browse

Author Corner