jablonkagroup/corral-reasoning-annotations
收藏Hugging Face2026-04-22 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/jablonkagroup/corral-reasoning-annotations
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Corral – Reasoning Annotations
language:
- en
license:
- mit
multilinguality:
- monolingual
source_datasets:
- original
task_categories:
- text-classification
annotations_creators:
- machine-generated
language_creators:
- expert-generated
- machine-generated
tags:
- corral
- benchmark
- llm-agents
- scientific-agents
- traces
- annotations
- llm-annotation
- evaluation
- chemistry
- materials-science
- knowledge
- reasoning
- reasoning-annotations
- epistemic-patterns
- process-evaluation
dataset_version: 0.0.1
dataset_release_date: '2026-04-22'
---
# *Corral* – Reasoning Annotations
<div align="center">

[](https://lamalab-org.github.io/corral/)
[](https://lamalab-org.github.io/corral/docs/)
[](https://github.com/lamalab-org/corral)
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2604.18805)
[](https://huggingface.co/datasets/jablonkagroup/corral-reasoning-annotations)
LLM epistemic annotations over Corral traces where the annotator judged that the agents do not reason scientifically
</div>
---
## 📋 Dataset Summary
This dataset is part of the *Corral* collection accompanying the paper [*AI scientists produce results without reasoning scientifically*](https://arxiv.org/abs/2604.18805). It contains **annotated evaluation traces** with **LLM-generated epistemic annotations** across the *Corral* benchmark.
The dataset is organized into **3 configurations**, with **one config per model** (the model that produced the traces). Within each configuration, **each row corresponds to one file to annotate**, represented as an annotated trace instance from that model.
The included annotations were produced by an **LLM annotator** that identified these cases as ones where the agents **do not reason scientifically**. This resource is intended for auditing, qualitative analysis, and process-level study of scientific-agent behaviour rather than for general-purpose model pre-training.
### 🎯 Supported Uses
- 🧠 Studying LLM-generated epistemic annotations over scientific-agent traces
- 📊 Auditing cases where an automatic annotator flagged non-scientific reasoning
- 📐 Comparing annotated reasoning failures across models and trace files
- 🔁 Building qualitative analysis sets for reasoning-process studies in scientific agents
---
## 🧪 About *Corral*
[*Corral*](https://lamalab-org.github.io/corral/) is a framework for the *science of agents and agents for science*. It provides a microservice architecture that **decouples agents from environments** via a client–server design (REST API), ensuring flexibility, reproducibility, and robust isolation.
- 🌍 **Environments** define the task space, available tools, and observable feedback — from chemistry labs to HPC clusters.
- 🤖 **Agents** are modular LLM-based entities supporting scaffolds such as ReAct, ToolCalling, LLMPlanner, and Reflection.
- 📝 **Tasks** define problems to solve, complete with scoring functions. Tasks can be chained into TaskGroups for complex multi-stage challenges.
*Corral* currently ships **8 environments**, **97 tools**, **115 tasks**, and **786 subtasks** spanning chemistry, physics, and materials science.
### 🌍 Environments
| Environment | Description | 🔧 Tools | 📝 Tasks/scope | 🔭 Scopes | ⏱️ Avg. trace length |
|---|---|:---:|:---:|:---:|:---:|
| 🧫 **Inorganic Qualitative Analysis** | Identify unknown cations in solution through systematic wet-lab procedures (reagent addition, flame tests, pH measurement, centrifugation, etc.). Observations are computed from thermodynamic data. Three scopes progressively increase the number of candidate ions. | 14 | 10 | 3 | 39.4 |
| ⚡ **Circuit Inference** | Recover the topology and component values of a hidden resistor network from pairwise resistance measurements. Tools provide series/parallel calculations, delta-wye transforms, and circuit validation. | 9 | 6 | 1 | 15.0 |
| 🔭 **Spectroscopic Structure Elucidation** | Determine the molecular structure of an unknown compound by requesting and interpreting spectroscopic data (MS, NMR, HSQC, IR) alongside reference databases for chemical shifts and isotope distributions. | 16 | 20 | 2 | 15.1 |
| 🧬 **Retrosynthetic Planning** | Design multi-step synthetic routes to target molecules under cost, step-count, and commercial-availability constraints, using a template catalogue and functional-group detection tools. | 15 | 8 | 3 | 25.5 |
| 🤖 **ML-based Property Prediction** | Assemble a complete ML pipeline to predict formation energies of material polymorphs using data from the Materials Project, covering feature engineering, XGBoost training, and cross-validation. | 14 | 3 | 1 | 16.6 |
| 🔬 **AFM Experiment Execution** | Analyze and interpret atomic force microscopy data for nanoscale surface characterization, including topographical and mechanical property measurements. | 6 | 1 | 4 | 26.3 |
| ⚛️ **Molecular Simulation** | Design and execute molecular dynamics simulations with LAMMPS to predict materials properties, covering the full workflow from crystal structure retrieval to force-field queries and log analysis. | 8 | 2–3 | 2 | 30.4 |
| 🏗️ **Adsorption Surface Construction** | Build adsorbate–slab configurations from bulk crystal structures for heterogeneous catalysis studies, integrating Materials Project retrieval, slab generation, and adsorption-site enumeration. | 15 | 3 | 1 | 19.6 |
---
## 🗂️ Dataset Structure
### Configs
The dataset exposes **3 configs**, with **one config per model**. Each config groups the annotated trace files associated with that model.
### Data Splits
All configs expose a single `train` split.
### Data Instances
Each row corresponds to one **annotated trace file** associated with a specific model. These rows contain the trace together with its **LLM epistemic annotations** and reflect cases where the annotator judged that the agent did not reason scientifically.
---
## 🏗️ Dataset Creation
### Curation Rationale
This dataset was created as part of *Corral* to support targeted inspection of **scientific reasoning failures** beyond end-task success. By releasing model-specific trace annotations generated by an automatic annotator, it provides a focused resource for analyzing epistemic failure patterns.
### Source Data
The traces were derived from *Corral* evaluation runs across environments and models. A downstream **LLM annotator** labeled the traces with epistemic annotations and identified cases suggesting that the agent did not reason scientifically. The resulting records were organized into model-specific configs, with each retained row corresponding to one annotated file.
---
## 🔗 Relation to Other Corral Artifacts
This dataset is one component of the broader *Corral* release and is best interpreted together with the matching task definitions, execution traces, reports, aggregate results, and reasoning annotations available in the [*Corral* collection](https://huggingface.co/collections/jablonkagroup/corral).
---
## 📄 Citation
```bibtex
@article{ríos-garcía2026ai,
title = {AI scientists produce results without reasoning scientifically},
author = {Martiño Ríos-García and Nawaf Alampara and Chandan Gupta and Indrajeet Mandal and Sajid Mannan and Ali Asghar Aghajani and N. M. Anoop Krishnan and Kevin Maik Jablonka},
year = {2026},
journal = {arXiv preprint arXiv: 2604.18805}
}
```
## 📜 License
This dataset is released under the [MIT License](https://opensource.org/licenses/MIT).
## Changelog
### 2026-04-22
- Initial release of the dataset card.
---
pretty_name: Corral — 推理注释
language:
- zh
license:
- mit
multilinguality:
- 单语言
source_datasets:
- 原创
task_categories:
- 文本分类
annotations_creators:
- 机器生成
language_creators:
- 专家生成
- 机器生成
tags:
- corral
- benchmark
- llm-agents
- scientific-agents
- traces
- annotations
- llm-annotation
- evaluation
- chemistry
- materials-science
- knowledge
- reasoning
- reasoning-annotations
- epistemic-patterns
- process-evaluation
dataset_version: 0.0.1
dataset_release_date: '2026-04-22'
---
# *Corral* — 推理注释
<div align="center">

[](https://lamalab-org.github.io/corral/)
[](https://lamalab-org.github.io/corral/docs/)
[](https://github.com/lamalab-org/corral)
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2604.18805)
[](https://huggingface.co/datasets/jablonkagroup/corral-reasoning-annotations)
针对Corral轨迹生成的大语言模型(LLM)认知注释,其中注释者判定对应智能体未进行科学推理
</div>
---
## 📋 数据集摘要
本数据集是配合论文《*AI科学家未通过科学推理生成成果*》(https://arxiv.org/abs/2604.18805) 发布的*Corral*系列数据集之一。其包含*Corral*基准测试中,带有大语言模型(LLM)生成的认知注释的**带注释评估轨迹**。
本数据集划分为**3种配置**,每种配置对应一个生成轨迹的模型。在每种配置中,每一行对应一个待注释文件,代表来自该模型的带注释轨迹实例。
本数据集包含的注释由**大语言模型注释器**生成,其将这些案例标记为智能体未进行科学推理的场景。本资源旨在用于科学智能体行为的审计、定性分析与过程级研究,而非通用模型预训练。
### 🎯 支持用途
- 🧠 针对科学智能体轨迹开展大语言模型(LLM)生成的认知注释研究
- 📊 审计自动注释器标记的非科学推理案例
- 📐 对比不同模型与轨迹文件间的注释推理失败情况
- 🔁 构建科学智能体推理过程研究的定性分析数据集
---
## 🧪 关于*Corral*
[*Corral*](https://lamalab-org.github.io/corral/) 是面向「智能体科学与面向科学的智能体」的框架。它通过客户端-服务器设计(REST API)实现**智能体与环境解耦**的微服务架构,确保灵活性、可复现性与强隔离性。
- 🌍 **环境**:定义任务空间、可用工具与可观测反馈,覆盖化学实验室到高性能计算集群等场景。
- 🤖 **智能体**:为基于大语言模型(LLM)的模块化实体,支持ReAct、ToolCalling、LLMPlanner与Reflection等架构。
- 📝 **任务**:定义待解决的问题,并配备评分函数。任务可被整合为任务组(TaskGroups)以实现复杂多阶段挑战。
目前*Corral*包含**8个环境**、**97种工具**、**115项任务**与**786个子任务**,覆盖化学、物理与材料科学领域。
### 🌍 环境
| 环境名称 | 描述 | 🔧 工具数量 | 📝 任务/范围 | 🔭 覆盖范围数 | ⏱️ 平均轨迹长度 |
|---|---|:---:|:---:|:---:|:---:|
| 🧫 **无机定性分析** | 通过系统化湿实验流程(试剂添加、焰色试验、pH测量、离心分离等)鉴定溶液中的未知阳离子。观测结果由热力学数据计算得到,共3种覆盖范围,候选离子数量逐步递增。 | 14 | 10 | 3 | 39.4 |
| ⚡ **电路推断** | 根据成对电阻测量结果还原隐藏电阻网络的拓扑结构与元件参数。工具支持串并联计算、三角-星型变换与电路验证。 | 9 | 6 | 1 | 15.0 |
| 🔭 **光谱结构解析** | 通过请求并解析光谱数据(质谱、核磁共振氢谱、异核单量子相干谱、红外光谱),结合化学位移与同位素分布参考数据库,确定未知化合物的分子结构。 | 16 | 20 | 2 | 15.1 |
| 🧬 **逆合成设计** | 基于模板目录与官能团检测工具,在成本、步骤数与商业化可得性约束下,设计目标分子的多步合成路线。 | 15 | 8 | 3 | 25.5 |
| 🤖 **基于机器学习的性质预测** | 组装完整机器学习流水线,利用材料项目(Materials Project)数据库的数据预测材料多形体的形成能,覆盖特征工程、XGBoost训练与交叉验证全流程。 | 14 | 3 | 1 | 16.6 |
| 🔬 **原子力显微镜实验执行** | 分析并解读原子力显微镜数据以表征纳米级表面,包括形貌与机械性能测量。 | 6 | 1 | 4 | 26.3 |
| ⚛️ **分子模拟** | 利用LAMMPS设计并执行分子动力学模拟以预测材料性质,覆盖从晶体结构检索、力场查询到日志分析的完整工作流。 | 8 | 2–3 | 2 | 30.4 |
| 🏗️ **吸附表面构建** | 从体相晶体结构构建吸附质-表面slab构型以开展多相催化研究,整合材料项目检索、表面slab生成与吸附位点枚举工具。 | 15 | 3 | 1 | 19.6 |
---
## 🗂️ 数据集结构
### 配置
本数据集包含**3种配置**,每种配置对应一个模型,将该模型生成的带注释轨迹文件归为一组。
### 数据划分
所有配置均仅包含一个`train`划分。
### 数据实例
每一行对应一个与特定模型关联的**带注释轨迹文件**,包含轨迹本身及其**大语言模型(LLM)认知注释**,对应注释者判定智能体未进行科学推理的场景。
---
## 🏗️ 数据集创建
### 整理依据
本数据集作为*Corral*项目的一部分创建,旨在支持针对**科学推理失败**的精细化审查,而非仅关注任务最终结果的成功与否。通过发布由自动注释器生成的模型专属轨迹注释,本数据集为认知失败模式(epistemic patterns)的分析提供了聚焦性资源。
### 源数据
轨迹源自跨环境与模型的*Corral*评估运行结果。下游的**大语言模型注释器**为轨迹添加认知注释,并标记出智能体未进行科学推理的案例。最终生成的记录被整理为模型专属配置,每一行保留的条目对应一个带注释的文件。
---
## 🔗 与其他Corral相关制品的关联
本数据集是更广泛的*Corral*发布内容的组成部分,建议结合[*Corral*数据集集合](https://huggingface.co/collections/jablonkagroup/corral)中提供的对应任务定义、执行轨迹、报告、汇总结果与推理注释一同使用。
---
## 📄 引用
bibtex
@article{ríos-garcía2026ai,
title = {AI scientists produce results without reasoning scientifically},
author = {Martiño Ríos-García and Nawaf Alampara and Chandan Gupta and Indrajeet Mandal and Sajid Mannan and Ali Asghar Aghajani and N. M. Anoop Krishnan and Kevin Maik Jablonka},
year = {2026},
journal = {arXiv preprint arXiv: 2604.18805}
}
## 📜 许可证
本数据集采用[MIT许可证](https://opensource.org/licenses/MIT)发布。
## 📝 变更日志
### 2026-04-22
- 本数据集卡片首次发布。
提供机构:
jablonkagroup



