Natural language processing application in Thai criminal law analysis: offence against property
收藏DataCite Commons2025-09-07 更新2026-05-04 收录
下载链接:
http://doi.nrct.go.th/?page=resolve_doi&resolve_doi=10.14457/TU.the.2024.581
下载链接
链接失效反馈官方服务:
资源简介:
Legal analysis utilizing natural language processing and machine learning technologies is a difficult undertaking that has recently sparked interest in both the academic and industry sectors. In this thesis, we first conduct a preliminary experiment on the model architecture and pipeline for legal text processing by performing the binary classification on textual legal data, specifically, the case fact description. Previous attempts at the classification task mostly employ traditional methods, such as the static embedding method in combination with machine learning or deep learning models. Even though such approaches can achieve notable performance, they have yet to reach state-of-the-art performance. With the significant leap in text and natural language processing, namely the advent of transformer-based models, along with their consistent improvement, we see the opportunity to adopt this development to the legal classification task. Our proposed modeling pipeline, WangchanBERTa-SFT-sc, can outperform the baseline model using the fastText embedding method and conventional machine learning models on the binary classification of whether or not the case is related to property offence, reaching exceptional test accuracy and F1-score of 94.5%. This finding highlights the capability and importance of contextual comprehension in dealing with the text classification task in a specific and highly sophisticated environment, particularly the legal domain. Furthermore, we explore the potentiality and unveil a suitable natural language processing algorithm to address an intriguing objective, an accurate legal case analysis according to Thai law, especially property-related offences. Using a human-annotated dataset summarized in colloquial Thai from Supreme Court decisions, this work investigates a different combination of NLP, ML, and rule-based techniques for accurate legal case analysis. With the intuition to design a computational pipeline that can mimic a lawyer’s cognitive process, we construct the pipeline with two major tasks, binary and multi-label classification, evaluated with a five-fold cross-validation method. Binary classification was performed to discern property-related cases from other categories, while the multi-label task aims at associating the case with specific offences, including theft (Section 334), snatching (Section 336), robbery (Section 339), and gang-robbery (Section 340). We achieved exceptional performance for the former task for an average accuracy reaching 94.2% for both the vanilla fastText and the fine-tuned WangchanBERTa with the MLP classifier, with a slightly higher average F1-score for vanilla fastText at 96.7%. For the part of multi-label classification, we obtained a remarkable result of 82% in average zero-one accuracy and 92% in average hamming accuracy, with the fine-tuned joint embedding classification pipeline incorporating rule-based post-processing, showing an improvement from without the rule-based technique. This highlights the possibility of integrating symbolic information from a rule-based algorithm together with statistical computation from machine learning techniques in performing a complex legal analysis task.
提供机构:
Thammasat University
创建时间:
2025-09-07



