DeepSeek-QueryBench: A Dataset for Evaluating the Performance and Stability of LLM-Generated Boolean Queries
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/7rrvctn3pj
下载链接
链接失效反馈官方服务:
资源简介:
The "DeepSeek-QueryBench" dataset provides the first comprehensive empirical data for evaluating the performance and stability of open-source Large Language Models (LLMs) in Boolean query generation for scholarly search. This dataset captures the complete workflow from query generation to retrieval evaluation, specifically designed to assess LLM capabilities under novice user conditions.
Core Components:
1. Original Model Outputs: Complete interaction records from DeepSeek-V3.1-Terminus across four operational modes (Default, Deep Thinking, Web Search, and their combination), with three independent generations per mode using a fixed simple Chinese prompt.
2. Generated Boolean Queries: Both original and syntactically corrected versions of 12 distinct Boolean queries targeting the interdisciplinary topic "3D printing in STEM education," formatted for Web of Science execution.
3. Retrieval Results: Complete bibliographic records (title, abstract, keywords, publication details) for all documents retrieved by each query execution in Web of Science Core Collection (2022-2024, article type), totaling 1,615 documents before deduplication.
4. Gold Standard Collection: A rigorously constructed benchmark of 172 relevant publications on "3D printing in STEM education," developed through baseline keyword retrieval and exhaustive forward/backward snowballing until saturation.
5. Performance Metrics: Comprehensive evaluation data including standard information retrieval metrics (Precision, Recall, F1-score, F3-score) and novel stability measures (Coefficient of Variation, Jaccard Similarity, Integration Change Rate) for each query and operational mode.
6. Analysis Materials: Supporting data for in-depth analysis including keyword frequency distributions, query structure categorization, semantic error patterns, and complementarity analysis between different query generations.
Unique Value Proposition:
This dataset addresses critical gaps in current LLM evaluation by focusing on:
Stability and reproducibility rather than just peak performance
Novice user scenarios with simple prompts and default configurations
Open-source model capabilities beyond the dominant GPT ecosystem
Real-world applicability through rigorous gold standard validation
The dataset supports research in AI-assisted information retrieval, evidence synthesis automation, LLM reliability assessment, and human-AI collaboration in scholarly search.
创建时间:
2025-11-24



