Chulalongkorn University Theses and Dissertations (Chula ETD)

Enhancing large language models for legal question answering : a case study on the land and building tax act in Thailand

Other Title (Parallel Title in Other Language of ETD)

การปรับปรุงแบบจำลองภาษาขนาดใหญ่สำหรับการตอบคำถามด้านกฎหมาย : กรณีศึกษาพระราชบัญญัติภาษีที่ดินและสิ่งปลูกสร้างในประเทศไทย

Nattapat Tantapong, Faculty of Engineering

Year (A.D.)

2025

Document Type

Thesis

First Advisor

Pittipol Kantavat

Second Advisor

Boonserm Kijsirikul

Faculty/College

Faculty of Engineering (คณะวิศวกรรมศาสตร์)

Department (if any)

Department of Computer Engineering (ภาควิชาวิศวกรรมคอมพิวเตอร์)

Degree Name

Master of Science

Degree Level

Master's Degree

Degree Discipline

Computer Science

DOI

10.58837/CHULA.THE.2025.189

Abstract

Thailand's Land and Buildings Tax Act requires interpreting multiple legal instruments. However, general-purpose large language models often produce answers that lack both accuracy and verifiable legal citations, which are essential for legal reasoning. This study investigates whether curriculum-structured fine-tuning enables small Thai-aligned models (8B parameters) to perform comparably to large commercial models. The research constructs a domain-specific corpus of 8,410 Q&A pairs with 30,612 hard-negative triplets, then evaluates four curriculum designs and four adapter ranks using retrieval completeness and answer quality metrics. The experimental results reveal two key findings. First, hard-negative retrieval training improves Multi-HitRate@5 by 2.6% and Multi-MRR@5 by 3.7%. Second, the Explicit→Implicit curriculum at rank-128 achieves a score of 0.862, matching GPT-4o with oracle retrieval (the upper bound of this research) and exceeding GPT-4o with standard retrieval (baseline) at 0.857. Importantly, the research finds that validation loss is unreliable for model selection: the model with the lowest validation loss underperforms the baseline by 7.4%, while the best model has higher validation loss but matches the upper bound. The results confirm that grounding-first curricula with domain-specific hard-negative data enable compact models to match frontier-model performance.

Other Abstract (Other language abstract of ETD)

พระราชบัญญัติภาษีที่ดินและสิ่งปลูกสร้างของไทยต้องอาศัยการตีความกฎหมายหลายฉบับร่วมกัน อย่างไรก็ดี แบบจำลองภาษาขนาดใหญ่ทั่วไปมักให้คำตอบที่ขาดทั้งความถูกต้องและการอ้างอิงข้อกฎหมายที่ตรวจสอบได้ ซึ่งจำเป็นสำหรับการให้เหตุผลทางกฎหมาย งานวิจัยนี้มุ่งศึกษาว่าการใช้โครงสร้างหลักสูตร (Curriculum) ในการฝึกฝนแบบจำลองจะช่วยให้แบบจำลองภาษาไทยขนาดเล็ก (8B พารามิเตอร์) สามารถทำงานได้เทียบเท่าแบบจำลองขนาดใหญ่เชิงพาณิชย์หรือไม่ โดยงานวิจัยนี้ได้สร้างคลังข้อมูลเฉพาะทางด้านกฎหมายภาษีที่ดินประกอบด้วยชุดคำถาม-คำตอบ 8,410 คู่ พร้อมข้อมูลเชิงลบอย่างหนัก (Hard-Negative) 30,612 ชุด จากนั้นทดสอบแบบจำลองใน 4 รูปแบบหลักสูตรและ 4 ขนาดตัวปรับ (Adapter Rank) โดยวัดผลด้วยความสมบูรณ์ของการค้นคืนและคุณภาพคำตอบ ผลการทดลองแสดงให้เห็นสองประเด็นสำคัญ ประเด็นแรก การฝึกแบบจำลองการค้นคืนด้วยข้อมูลเชิงลบอย่างหนักช่วยเพิ่มประสิทธิภาพด้านความสมบูรณ์ของการค้นคืนโดยค่า Multi-HitRate@5 เพิ่มขึ้น 2.6% และ Multi-MRR@5 เพิ่มขึ้น 3.7% ประเด็นที่สอง แบบจำลองที่ใช้หลักสูตร Explicit→Implicit ที่ขนาด rank-128 ทำคะแนนได้ 0.862 เทียบเท่ากับ GPT-4o ที่ใช้ระบบค้นคืนแบบสมบูรณ์ (Oracle Retrieval) ซึ่งเป็นขอบเขตบน (Upper Bound) ของงานวิจัยนี้ และสูงกว่า GPT-4o ที่ใช้ระบบค้นคืนมาตรฐาน (Baseline) ที่ 0.857 ที่สำคัญ งานวิจัยนี้พบว่า Validation Loss ไม่สามารถใช้เป็นเกณฑ์เลือกแบบจำลองได้อย่างน่าเชื่อถือ โดยแบบจำลองที่มี Validation Loss ต่ำสุดกลับให้ผลลัพธ์ด้อยกว่าค่าเปรียบเทียบพื้นฐาน (Baseline) ถึง 7.4% ขณะที่แบบจำลองที่ดีที่สุดมี Validation Loss สูงกว่าแต่ให้ผลลัพธ์เทียบเท่าขอบเขตบน ผลการวิจัยยืนยันว่าการใช้หลักสูตรที่ "ยึดการอ้างอิงก่อน" (Grounding-First Curricula) กับข้อมูลเชิงลบอย่างหนักสามารถทำให้แบบจำลองภาษาขนาดเล็กทำงานได้เทียบเท่าแบบจำลองขนาดใหญ่ระดับแนวหน้า

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Recommended Citation

Tantapong, Nattapat, "Enhancing large language models for legal question answering : a case study on the land and building tax act in Thailand" (2025). Chulalongkorn University Theses and Dissertations (Chula ETD). 75084.
https://digital.car.chula.ac.th/chulaetd/75084

Download

Included in

Computer Sciences Commons

COinS

Chulalongkorn University Theses and Dissertations (Chula ETD)

Enhancing large language models for legal question answering : a case study on the land and building tax act in Thailand

Other Title (Parallel Title in Other Language of ETD)

Year (A.D.)

Document Type

First Advisor

Second Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Search

Browse

Author Corner

Chulalongkorn University Theses and Dissertations (Chula ETD)

Enhancing large language models for legal question answering : a case study on the land and building tax act in Thailand

Other Title (Parallel Title in Other Language of ETD)

Author

Year (A.D.)

Document Type

First Advisor

Second Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Share

Search

Browse

Author Corner