jablonkagroup/corral-QAs-topic_reports
收藏Hugging Face2026-04-22 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/jablonkagroup/corral-QAs-topic_reports
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Corral – QA Topic Reports
language:
- en
license:
- mit
multilinguality:
- monolingual
source_datasets:
- original
task_categories:
- question-answering
annotations_creators:
- machine-generated
language_creators:
- expert-generated
- machine-generated
tags:
- corral
- benchmark
- llm-agents
- scientific-agents
- qa
- evaluation
- chemistry
- materials-science
- knowledge
- reasoning
- irt
- item-response-theory
dataset_version: 0.0.1
dataset_release_date: '2026-04-22'
---
# *Corral* – QA Topic Reports
<div align="center">

[](https://lamalab-org.github.io/corral/)
[](https://lamalab-org.github.io/corral/docs/)
[](https://github.com/lamalab-org/corral)
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2604.18805)
[](https://huggingface.co/datasets/jablonkagroup/corral-QAs-topic_reports)
Averaged QA results for factual-knowledge and reasoning evaluations across all 8 Corral environments
</div>
---
## 📋 Dataset Summary
This dataset is part of the *Corral* collection accompanying the paper [*AI scientists produce results without reasoning scientifically*](https://arxiv.org/abs/2604.18805). It contains the **averaged results** of the **question-answer evaluations** used to test the **factual knowledge** and **reasoning ability** of models across all **8 Corral environments**.
The dataset is organized into **1 configuration**, `default`. Within this config, each row corresponds to a unique combination of **environment**, **model**, and **evaluation dimension**, where the dimension is either **knowledge** or **reasoning**.
These aggregated QA reports summarize the outcomes of the items used in the **Item Response Theory (IRT)** analyses reported in the *Corral* study, where they support the latent **knowledge** and **reasoning** factors. This resource is intended for evaluation, psychometric modeling, and comparative analysis of scientific-agent capabilities rather than for general-purpose model pre-training.
### 🎯 Supported Uses
- 🧠 Evaluating averaged factual-knowledge and reasoning performance across Corral environments
- 📊 Reproducing and extending the topic-level analyses reported in the paper
- 📐 Studying latent knowledge and reasoning factors in scientific-agent benchmarks
- 🔁 Comparing aggregated QA outcomes across models and environments
---
## 🧪 About *Corral*
[*Corral*](https://lamalab-org.github.io/corral/) is a framework for the *science of agents and agents for science*. It provides a microservice architecture that **decouples agents from environments** via a client–server design (REST API), ensuring flexibility, reproducibility, and robust isolation.
- 🌍 **Environments** define the task space, available tools, and observable feedback — from chemistry labs to HPC clusters.
- 🤖 **Agents** are modular LLM-based entities supporting scaffolds such as ReAct, ToolCalling, LLMPlanner, and Reflection.
- 📝 **Tasks** define problems to solve, complete with scoring functions. Tasks can be chained into TaskGroups for complex multi-stage challenges.
*Corral* currently ships **8 environments**, **97 tools**, **115 tasks**, and **786 subtasks** spanning chemistry, physics, and materials science.
### 🌍 Environments
| Environment | Description | 🔧 Tools | 📝 Tasks/scope | 🔭 Scopes | ⏱️ Avg. trace length |
|---|---|:---:|:---:|:---:|:---:|
| 🧫 **Inorganic Qualitative Analysis** | Identify unknown cations in solution through systematic wet-lab procedures (reagent addition, flame tests, pH measurement, centrifugation, etc.). Observations are computed from thermodynamic data. Three scopes progressively increase the number of candidate ions. | 14 | 10 | 3 | 39.4 |
| ⚡ **Circuit Inference** | Recover the topology and component values of a hidden resistor network from pairwise resistance measurements. Tools provide series/parallel calculations, delta-wye transforms, and circuit validation. | 9 | 6 | 1 | 15.0 |
| 🔭 **Spectroscopic Structure Elucidation** | Determine the molecular structure of an unknown compound by requesting and interpreting spectroscopic data (MS, NMR, HSQC, IR) alongside reference databases for chemical shifts and isotope distributions. | 16 | 20 | 2 | 15.1 |
| 🧬 **Retrosynthetic Planning** | Design multi-step synthetic routes to target molecules under cost, step-count, and commercial-availability constraints, using a template catalogue and functional-group detection tools. | 15 | 8 | 3 | 25.5 |
| 🤖 **ML-based Property Prediction** | Assemble a complete ML pipeline to predict formation energies of material polymorphs using data from the Materials Project, covering feature engineering, XGBoost training, and cross-validation. | 14 | 3 | 1 | 16.6 |
| 🔬 **AFM Experiment Execution** | Analyze and interpret atomic force microscopy data for nanoscale surface characterization, including topographical and mechanical property measurements. | 6 | 1 | 4 | 26.3 |
| ⚛️ **Molecular Simulation** | Design and execute molecular dynamics simulations with LAMMPS to predict materials properties, covering the full workflow from crystal structure retrieval to force-field queries and log analysis. | 8 | 2–3 | 2 | 30.4 |
| 🏗️ **Adsorption Surface Construction** | Build adsorbate–slab configurations from bulk crystal structures for heterogeneous catalysis studies, integrating Materials Project retrieval, slab generation, and adsorption-site enumeration. | 15 | 3 | 1 | 19.6 |
---
## 🗂️ Dataset Structure
### Configs
Only `default` config is available, which includes the averaged QA topic reports across all environments, evaluated models, and both evaluation dimensions.
### Data Splits
All configs expose a single `train` split.
### Data Instances
Each row corresponds to one **averaged QA topic report** for a specific combination of Corral environment, model, and evaluation dimension (`knowledge` or `reasoning`). Rows summarize aggregated outcomes over the corresponding QA set rather than individual question-answer items.
---
## 🏗️ Dataset Creation
### Curation Rationale
This dataset was created as part of *Corral* to summarize scientific-agent capabilities beyond end-task success, separating **factual knowledge** from **reasoning ability** through targeted QA evaluations whose results can be analyzed with IRT.
### Source Data
The underlying QAs were derived from the task content, domain knowledge, and reasoning demands of the *Corral* benchmark environments. They were constructed to probe environment-specific factual understanding and multi-step reasoning, and were then used in IRT modeling as indicators of the latent knowledge and reasoning factors. This dataset contains the averaged results of those QA evaluations grouped by environment, model, and evaluation dimension.
---
## 🔗 Relation to Other Corral Artifacts
This dataset is one component of the broader *Corral* release and is best interpreted together with the matching task definitions, execution traces, reports, aggregate results, and reasoning annotations available in the [*Corral* collection](https://huggingface.co/collections/jablonkagroup/corral).
---
## 📄 Citation
```bibtex
@article{ríos-garcía2026ai,
title = {AI scientists produce results without reasoning scientifically},
author = {Martiño Ríos-García and Nawaf Alampara and Chandan Gupta and Indrajeet Mandal and Sajid Mannan and Ali Asghar Aghajani and N. M. Anoop Krishnan and Kevin Maik Jablonka},
year = {2026},
journal = {arXiv preprint arXiv: 2604.18805}
}
```
## 📜 License
This dataset is released under the [MIT License](https://opensource.org/licenses/MIT).
## Changelog
### 2026-04-22
- Initial release of the dataset card.
提供机构:
jablonkagroup



