five

jablonkagroup/corral-QAs-topic_reports

收藏
Hugging Face2026-04-22 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/jablonkagroup/corral-QAs-topic_reports
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Corral – QA Topic Reports language: - en license: - mit multilinguality: - monolingual source_datasets: - original task_categories: - question-answering annotations_creators: - machine-generated language_creators: - expert-generated - machine-generated tags: - corral - benchmark - llm-agents - scientific-agents - qa - evaluation - chemistry - materials-science - knowledge - reasoning - irt - item-response-theory dataset_version: 0.0.1 dataset_release_date: '2026-04-22' --- # *Corral* – QA Topic Reports <div align="center"> ![Corral Logo](corral_logo_final.png) [![Website](https://img.shields.io/badge/🌐-Website-green)](https://lamalab-org.github.io/corral/) [![Docs](https://img.shields.io/badge/📚-Docs-blue)](https://lamalab-org.github.io/corral/docs/) [![GitHub](https://img.shields.io/badge/💻-Code-black?logo=github)](https://github.com/lamalab-org/corral) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2604.18805) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/corral-QAs-topic_reports) Averaged QA results for factual-knowledge and reasoning evaluations across all 8 Corral environments </div> --- ## 📋 Dataset Summary This dataset is part of the *Corral* collection accompanying the paper [*AI scientists produce results without reasoning scientifically*](https://arxiv.org/abs/2604.18805). It contains the **averaged results** of the **question-answer evaluations** used to test the **factual knowledge** and **reasoning ability** of models across all **8 Corral environments**. The dataset is organized into **1 configuration**, `default`. Within this config, each row corresponds to a unique combination of **environment**, **model**, and **evaluation dimension**, where the dimension is either **knowledge** or **reasoning**. These aggregated QA reports summarize the outcomes of the items used in the **Item Response Theory (IRT)** analyses reported in the *Corral* study, where they support the latent **knowledge** and **reasoning** factors. This resource is intended for evaluation, psychometric modeling, and comparative analysis of scientific-agent capabilities rather than for general-purpose model pre-training. ### 🎯 Supported Uses - 🧠 Evaluating averaged factual-knowledge and reasoning performance across Corral environments - 📊 Reproducing and extending the topic-level analyses reported in the paper - 📐 Studying latent knowledge and reasoning factors in scientific-agent benchmarks - 🔁 Comparing aggregated QA outcomes across models and environments --- ## 🧪 About *Corral* [*Corral*](https://lamalab-org.github.io/corral/) is a framework for the *science of agents and agents for science*. It provides a microservice architecture that **decouples agents from environments** via a client–server design (REST API), ensuring flexibility, reproducibility, and robust isolation. - 🌍 **Environments** define the task space, available tools, and observable feedback — from chemistry labs to HPC clusters. - 🤖 **Agents** are modular LLM-based entities supporting scaffolds such as ReAct, ToolCalling, LLMPlanner, and Reflection. - 📝 **Tasks** define problems to solve, complete with scoring functions. Tasks can be chained into TaskGroups for complex multi-stage challenges. *Corral* currently ships **8 environments**, **97 tools**, **115 tasks**, and **786 subtasks** spanning chemistry, physics, and materials science. ### 🌍 Environments | Environment | Description | 🔧 Tools | 📝 Tasks/scope | 🔭 Scopes | ⏱️ Avg. trace length | |---|---|:---:|:---:|:---:|:---:| | 🧫 **Inorganic Qualitative Analysis** | Identify unknown cations in solution through systematic wet-lab procedures (reagent addition, flame tests, pH measurement, centrifugation, etc.). Observations are computed from thermodynamic data. Three scopes progressively increase the number of candidate ions. | 14 | 10 | 3 | 39.4 | | ⚡ **Circuit Inference** | Recover the topology and component values of a hidden resistor network from pairwise resistance measurements. Tools provide series/parallel calculations, delta-wye transforms, and circuit validation. | 9 | 6 | 1 | 15.0 | | 🔭 **Spectroscopic Structure Elucidation** | Determine the molecular structure of an unknown compound by requesting and interpreting spectroscopic data (MS, NMR, HSQC, IR) alongside reference databases for chemical shifts and isotope distributions. | 16 | 20 | 2 | 15.1 | | 🧬 **Retrosynthetic Planning** | Design multi-step synthetic routes to target molecules under cost, step-count, and commercial-availability constraints, using a template catalogue and functional-group detection tools. | 15 | 8 | 3 | 25.5 | | 🤖 **ML-based Property Prediction** | Assemble a complete ML pipeline to predict formation energies of material polymorphs using data from the Materials Project, covering feature engineering, XGBoost training, and cross-validation. | 14 | 3 | 1 | 16.6 | | 🔬 **AFM Experiment Execution** | Analyze and interpret atomic force microscopy data for nanoscale surface characterization, including topographical and mechanical property measurements. | 6 | 1 | 4 | 26.3 | | ⚛️ **Molecular Simulation** | Design and execute molecular dynamics simulations with LAMMPS to predict materials properties, covering the full workflow from crystal structure retrieval to force-field queries and log analysis. | 8 | 2–3 | 2 | 30.4 | | 🏗️ **Adsorption Surface Construction** | Build adsorbate–slab configurations from bulk crystal structures for heterogeneous catalysis studies, integrating Materials Project retrieval, slab generation, and adsorption-site enumeration. | 15 | 3 | 1 | 19.6 | --- ## 🗂️ Dataset Structure ### Configs Only `default` config is available, which includes the averaged QA topic reports across all environments, evaluated models, and both evaluation dimensions. ### Data Splits All configs expose a single `train` split. ### Data Instances Each row corresponds to one **averaged QA topic report** for a specific combination of Corral environment, model, and evaluation dimension (`knowledge` or `reasoning`). Rows summarize aggregated outcomes over the corresponding QA set rather than individual question-answer items. --- ## 🏗️ Dataset Creation ### Curation Rationale This dataset was created as part of *Corral* to summarize scientific-agent capabilities beyond end-task success, separating **factual knowledge** from **reasoning ability** through targeted QA evaluations whose results can be analyzed with IRT. ### Source Data The underlying QAs were derived from the task content, domain knowledge, and reasoning demands of the *Corral* benchmark environments. They were constructed to probe environment-specific factual understanding and multi-step reasoning, and were then used in IRT modeling as indicators of the latent knowledge and reasoning factors. This dataset contains the averaged results of those QA evaluations grouped by environment, model, and evaluation dimension. --- ## 🔗 Relation to Other Corral Artifacts This dataset is one component of the broader *Corral* release and is best interpreted together with the matching task definitions, execution traces, reports, aggregate results, and reasoning annotations available in the [*Corral* collection](https://huggingface.co/collections/jablonkagroup/corral). --- ## 📄 Citation ```bibtex @article{ríos-garcía2026ai, title = {AI scientists produce results without reasoning scientifically}, author = {Martiño Ríos-García and Nawaf Alampara and Chandan Gupta and Indrajeet Mandal and Sajid Mannan and Ali Asghar Aghajani and N. M. Anoop Krishnan and Kevin Maik Jablonka}, year = {2026}, journal = {arXiv preprint arXiv: 2604.18805} } ``` ## 📜 License This dataset is released under the [MIT License](https://opensource.org/licenses/MIT). ## Changelog ### 2026-04-22 - Initial release of the dataset card.
提供机构:
jablonkagroup
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作