Automatic Corpus Query Generation Method Based on Large Language Model

中国科学数据2026-02-09 更新2026-04-25 收录

下载链接：

https://www.sciengine.com/AA/doi/10.19678/j.issn.1000-3428.0070118

下载链接

链接失效反馈

官方服务：

资源简介：

Corpus Query Language (CQL) is a specialized tool for searching and analyzing linguistic corpora. Automating the conversion of natural language queries into CQL statements significantly lowers entry barriers for corpus users. Although Large Language Models (LLMs) excel in many natural language generation tasks, their performance in generating CQL statements has been suboptimal. To address this issue, a method for automatic corpus query generation based on contextual learning in LLMs, called T2CQL, is proposed. First, this method distills CQL writing rules into a comprehensive yet concise set of Text-to-CQL grammar knowledge standards. This serves as the basis for the LLMs to perform automatic Text-to-CQL conversions, compensating for their lack of domain-specific knowledge. Subsequently, the top k most relevant Text-CQL sample pairs for the current natural language query are selected using an embedding model. These samples serve as reference points and help the LLMs understand the grammar rules. Finally, a calibration strategy to mitigate biases in the LLM's CQL generation is implemented, thereby enhancing its performance. The proposed method is evaluated using multiple LLM on a test set of 1 177 samples. The results demonstrate that T2CQL significantly improves the performance of LLMs in Text-to-CQL conversion tasks, achieving an optimal Execution Accuracy (EX) of 85.13%.

创建时间：

2026-02-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集