five

DeepSeek-QueryBench: A Dataset for Evaluating the Performance and Stability of LLM-Generated Boolean Queries

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/7rrvctn3pj
下载链接
链接失效反馈
官方服务:
资源简介:
The "DeepSeek-QueryBench" dataset provides the first comprehensive empirical data for evaluating the performance and stability of open-source Large Language Models (LLMs) in Boolean query generation for scholarly search. This dataset captures the complete workflow from query generation to retrieval evaluation, specifically designed to assess LLM capabilities under novice user conditions. Core Components: 1. Original Model Outputs: Complete interaction records from DeepSeek-V3.1-Terminus across four operational modes (Default, Deep Thinking, Web Search, and their combination), with three independent generations per mode using a fixed simple Chinese prompt. 2. Generated Boolean Queries: Both original and syntactically corrected versions of 12 distinct Boolean queries targeting the interdisciplinary topic "3D printing in STEM education," formatted for Web of Science execution. 3. Retrieval Results: Complete bibliographic records (title, abstract, keywords, publication details) for all documents retrieved by each query execution in Web of Science Core Collection (2022-2024, article type), totaling 1,615 documents before deduplication. 4. Gold Standard Collection: A rigorously constructed benchmark of 172 relevant publications on "3D printing in STEM education," developed through baseline keyword retrieval and exhaustive forward/backward snowballing until saturation. 5. Performance Metrics: Comprehensive evaluation data including standard information retrieval metrics (Precision, Recall, F1-score, F3-score) and novel stability measures (Coefficient of Variation, Jaccard Similarity, Integration Change Rate) for each query and operational mode. 6. Analysis Materials: Supporting data for in-depth analysis including keyword frequency distributions, query structure categorization, semantic error patterns, and complementarity analysis between different query generations. Unique Value Proposition: This dataset addresses critical gaps in current LLM evaluation by focusing on: Stability and reproducibility rather than just peak performance Novice user scenarios with simple prompts and default configurations Open-source model capabilities beyond the dominant GPT ecosystem Real-world applicability through rigorous gold standard validation The dataset supports research in AI-assisted information retrieval, evidence synthesis automation, LLM reliability assessment, and human-AI collaboration in scholarly search.
创建时间:
2025-11-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作