Chulalongkorn University Theses and Dissertations (Chula ETD)

A comparison of imbalanced data handling methods for pre-trained model in multi-label classification of stack overflow

Other Title (Parallel Title in Other Language of ETD)

การเปรียบเทียบวิธีการจัดการข้อมูลที่ไม่สมดุลสำหรับแบบจำลองที่ได้รับการฝึกฝนแล้วสำหรับวิธีการจำแนกประเภทแบบหลายลาเบลในสแต็กโอเวอร์โฟลว์

Arisa Umparat, Faculty of Commerce and Accountancy

Year (A.D.)

2022

Document Type

Thesis

First Advisor

Suronapee Phoomvuthisarn

Faculty/College

Faculty of Commerce and Accountancy (คณะพาณิชยศาสตร์และการบัญชี)

Department (if any)

Department of Statistics (ภาควิชาสถิติ)

Degree Name

Master of Science

Degree Level

Master's Degree

Degree Discipline

Statistics

DOI

10.58837/CHULA.THE.2022.338

Abstract

Tag classification is essential in Stack Overflow. Instead of combining through pages or replies of irrelevant information, users can easily and quickly pinpoint relevant posts and answers using tags. Since User-submitted posts can have multiple tags, classifying tags in Stack Overflow can be challenging. This results in an imbalance problem between labels in the whole labelset. Pretrained deep learning models with small datasets can improve tag classification accuracy. Common multi-label resampling techniques with machine learning classifiers can also fix this issue. Still, few studies have explored which resampling technique can improve the performance of pre-trained deep models for predicting tags. To address this gap, we experimented to evaluate the effectiveness of ELECTRA, a powerful deep learning pre-trained model, with various multi-label resampling techniques in decreasing the imbalance that induces mislabeling in Stack Overflow's tagging posts. We compared six resampling techniques, such as ML-ROS, MLSMOTE, MLeNN, MLTL, ML-SOL, and REMEDIAL, to find the best method to mitigate the imbalance and improve tag prediction accuracy. Our results show that MLTL is the most effective selection to tackle the inequality in multi-label classification for our Stack Overflow data with deep learning scenarios. MLTL achieved 0.517, 0.804, 0.467, and 0.98 from the metrics Precision@1, Recall@5, F1-score@1, and AUC, respectively. Conversely, MLeNN gained only 0.323, 0.648, 0.277, and 0.95 from the same metrics.

Other Abstract (Other language abstract of ETD)

การจัดประเภทแท็กมีความสำคัญในสแต็กโอเวอร์โฟลว์ นอกจากจะช่วยให้ผู้ใช้สามารถค้นหาข้อมูลแล้วยังช่วยเสนอวิธีแก้ปัญหาที่เกี่ยวข้องอย่างมีประสิทธิภาพมากขึ้นอีกด้วย เนื่องจากคำถามในโพสต์สามารถมีได้หลายแท็กดังนั้นการจัดประเภทแท็กในสแต็กโอเวอร์โฟลว์จึงถือเป็นเรื่องที่ท้าทาย ซึ่งส่งผลให้เกิดปัญหาความไม่สมดุลระหว่างแท็กกับแท็กทั้งหมด เราจึงนำโมเดลการเรียนรู้เชิงลึกที่ได้รับการฝึกฝนแล้วพร้อมกับชุดข้อมูลขนาดเล็กมาทดลองเพื่อเพิ่มความแม่นยำในการจำแนกหรือการทำนายแท็กได้ โดยใช้เทคนิคการสุ่มตัวอย่างใหม่ที่เหมาะกับการจำแนกประเภทแบบหลายลาเบลโดยเฉพาะ โดยทั่วไปแล้วเพียงแค่ใช้เทคนิคการเรียนรู้ของเครื่องก็สามารถแก้ไขปัญหานี้ได้เช่นกัน แต่มีแค่ไม่กี่งานวิจัยเท่านั้นที่ทดลองว่าเทคนิคการสุ่มตัวอย่างใหม่แบบใดที่สามารถปรับปรุงประสิทธิภาพของโมเดลเชิงลึกโดยใช้แบบจำลองที่ได้รับการฝึกฝนแล้วสำหรับการทำนายแท็ก เพื่อจัดการกับข้อจำกัดนี้ เราได้ทดลองเพื่อประเมินประสิทธิภาพของ ELECTRA ซึ่งเป็นโมเดลการเรียนรู้เชิงลึกที่ได้รับการฝึกฝนแล้วที่ทรงพลัง อีกทั้งยังเสริมด้วยด้วยเทคนิคการสุ่มตัวอย่างใหม่แบบหลายลาเบลเพื่อลดความไม่สมดุลของข้อมูลที่ทำให้เกิดการติดลาเบลผิดในโพสต์ของสแต็กโอเวอร์โฟลว์ เราเปรียบเทียบเทคนิคการสุ่มใหม่ 6 เทคนิค ประกอบไปด้วย ML-ROS, MLSMOTE, MLeNN, MLTL, ML-SOL และ REMEDIAL เพื่อหาวิธีที่ดีที่สุดในการลดความไม่สมดุลของข้อมูล พร้อมทั้งปรับปรุงความแม่นยำในการคาดทำนายแท็ก ซึงผลลัพธ์ของเราแสดงให้เห็นว่า MLTL เป็นตัวเลือกที่มีประสิทธิภาพมากที่สุดในการจัดการกับความไม่สมดุลในการจำแนกประเภทหลายลาเบลสำหรับข้อมูลในสแต็กโอเวอร์โฟลว์ในการเรียนรู้เชิงลึก โดยเทคนิค MLTL ทำได้ 0.517, 0.804, 0.467 และ 0.98 จากตัวชี้วัด Precision@1, Recall@5, F1-score@1 และ AUC ตามลำดับ แต่ MLeNN กลับทำได้แค่เพียง 0.323, 0.648, 0.277 และ 0.95 จากตัววัดผลเดียวกัน

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Recommended Citation

Umparat, Arisa, "A comparison of imbalanced data handling methods for pre-trained model in multi-label classification of stack overflow" (2022). Chulalongkorn University Theses and Dissertations (Chula ETD). 6049.
https://digital.car.chula.ac.th/chulaetd/6049

Download

Included in

Statistics and Probability Commons

COinS

Chulalongkorn University Theses and Dissertations (Chula ETD)

A comparison of imbalanced data handling methods for pre-trained model in multi-label classification of stack overflow

Other Title (Parallel Title in Other Language of ETD)

Year (A.D.)

Document Type

First Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Search

Browse

Author Corner

Chulalongkorn University Theses and Dissertations (Chula ETD)

A comparison of imbalanced data handling methods for pre-trained model in multi-label classification of stack overflow

Other Title (Parallel Title in Other Language of ETD)

Author

Year (A.D.)

Document Type

First Advisor

Faculty/College

Department (if any)

Degree Name

Degree Level

Degree Discipline

DOI

Abstract

Other Abstract (Other language abstract of ETD)

Creative Commons License

Recommended Citation

Included in

Share

Search

Browse

Author Corner