five

Automatic Corpus Query Generation Method Based on Large Language Model

收藏
中国科学数据2026-02-09 更新2026-04-25 收录
下载链接:
https://www.sciengine.com/AA/doi/10.19678/j.issn.1000-3428.0070118
下载链接
链接失效反馈
官方服务:
资源简介:
Corpus Query Language (CQL) is a specialized tool for searching and analyzing linguistic corpora. Automating the conversion of natural language queries into CQL statements significantly lowers entry barriers for corpus users. Although Large Language Models (LLMs) excel in many natural language generation tasks, their performance in generating CQL statements has been suboptimal. To address this issue, a method for automatic corpus query generation based on contextual learning in LLMs, called T2CQL, is proposed. First, this method distills CQL writing rules into a comprehensive yet concise set of Text-to-CQL grammar knowledge standards. This serves as the basis for the LLMs to perform automatic Text-to-CQL conversions, compensating for their lack of domain-specific knowledge. Subsequently, the top k most relevant Text-CQL sample pairs for the current natural language query are selected using an embedding model. These samples serve as reference points and help the LLMs understand the grammar rules. Finally, a calibration strategy to mitigate biases in the LLM's CQL generation is implemented, thereby enhancing its performance. The proposed method is evaluated using multiple LLM on a test set of 1 177 samples. The results demonstrate that T2CQL significantly improves the performance of LLMs in Text-to-CQL conversion tasks, achieving an optimal Execution Accuracy (EX) of 85.13%.
创建时间:
2026-02-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作