jablonkagroup/corral-reasoning-annotations

Name: jablonkagroup/corral-reasoning-annotations
Creator: jablonkagroup
Published: 2026-04-22 12:40:49
License: 暂无描述

Hugging Face2026-04-22 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/jablonkagroup/corral-reasoning-annotations

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: Corral – Reasoning Annotations language: - en license: - mit multilinguality: - monolingual source_datasets: - original task_categories: - text-classification annotations_creators: - machine-generated language_creators: - expert-generated - machine-generated tags: - corral - benchmark - llm-agents - scientific-agents - traces - annotations - llm-annotation - evaluation - chemistry - materials-science - knowledge - reasoning - reasoning-annotations - epistemic-patterns - process-evaluation dataset_version: 0.0.1 dataset_release_date: '2026-04-22' --- # *Corral* – Reasoning Annotations <div align="center"> ![Corral Logo](corral_logo_final.png) [![Website](https://img.shields.io/badge/🌐-Website-green)](https://lamalab-org.github.io/corral/) [![Docs](https://img.shields.io/badge/📚-Docs-blue)](https://lamalab-org.github.io/corral/docs/) [![GitHub](https://img.shields.io/badge/💻-Code-black?logo=github)](https://github.com/lamalab-org/corral) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2604.18805) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/corral-reasoning-annotations) LLM epistemic annotations over Corral traces where the annotator judged that the agents do not reason scientifically </div> --- ## 📋 Dataset Summary This dataset is part of the *Corral* collection accompanying the paper [*AI scientists produce results without reasoning scientifically*](https://arxiv.org/abs/2604.18805). It contains **annotated evaluation traces** with **LLM-generated epistemic annotations** across the *Corral* benchmark. The dataset is organized into **3 configurations**, with **one config per model** (the model that produced the traces). Within each configuration, **each row corresponds to one file to annotate**, represented as an annotated trace instance from that model. The included annotations were produced by an **LLM annotator** that identified these cases as ones where the agents **do not reason scientifically**. This resource is intended for auditing, qualitative analysis, and process-level study of scientific-agent behaviour rather than for general-purpose model pre-training. ### 🎯 Supported Uses - 🧠 Studying LLM-generated epistemic annotations over scientific-agent traces - 📊 Auditing cases where an automatic annotator flagged non-scientific reasoning - 📐 Comparing annotated reasoning failures across models and trace files - 🔁 Building qualitative analysis sets for reasoning-process studies in scientific agents --- ## 🧪 About *Corral* [*Corral*](https://lamalab-org.github.io/corral/) is a framework for the *science of agents and agents for science*. It provides a microservice architecture that **decouples agents from environments** via a client–server design (REST API), ensuring flexibility, reproducibility, and robust isolation. - 🌍 **Environments** define the task space, available tools, and observable feedback — from chemistry labs to HPC clusters. - 🤖 **Agents** are modular LLM-based entities supporting scaffolds such as ReAct, ToolCalling, LLMPlanner, and Reflection. - 📝 **Tasks** define problems to solve, complete with scoring functions. Tasks can be chained into TaskGroups for complex multi-stage challenges. *Corral* currently ships **8 environments**, **97 tools**, **115 tasks**, and **786 subtasks** spanning chemistry, physics, and materials science. ### 🌍 Environments | Environment | Description | 🔧 Tools | 📝 Tasks/scope | 🔭 Scopes | ⏱️ Avg. trace length | |---|---|:---:|:---:|:---:|:---:| | 🧫 **Inorganic Qualitative Analysis** | Identify unknown cations in solution through systematic wet-lab procedures (reagent addition, flame tests, pH measurement, centrifugation, etc.). Observations are computed from thermodynamic data. Three scopes progressively increase the number of candidate ions. | 14 | 10 | 3 | 39.4 | | ⚡ **Circuit Inference** | Recover the topology and component values of a hidden resistor network from pairwise resistance measurements. Tools provide series/parallel calculations, delta-wye transforms, and circuit validation. | 9 | 6 | 1 | 15.0 | | 🔭 **Spectroscopic Structure Elucidation** | Determine the molecular structure of an unknown compound by requesting and interpreting spectroscopic data (MS, NMR, HSQC, IR) alongside reference databases for chemical shifts and isotope distributions. | 16 | 20 | 2 | 15.1 | | 🧬 **Retrosynthetic Planning** | Design multi-step synthetic routes to target molecules under cost, step-count, and commercial-availability constraints, using a template catalogue and functional-group detection tools. | 15 | 8 | 3 | 25.5 | | 🤖 **ML-based Property Prediction** | Assemble a complete ML pipeline to predict formation energies of material polymorphs using data from the Materials Project, covering feature engineering, XGBoost training, and cross-validation. | 14 | 3 | 1 | 16.6 | | 🔬 **AFM Experiment Execution** | Analyze and interpret atomic force microscopy data for nanoscale surface characterization, including topographical and mechanical property measurements. | 6 | 1 | 4 | 26.3 | | ⚛️ **Molecular Simulation** | Design and execute molecular dynamics simulations with LAMMPS to predict materials properties, covering the full workflow from crystal structure retrieval to force-field queries and log analysis. | 8 | 2–3 | 2 | 30.4 | | 🏗️ **Adsorption Surface Construction** | Build adsorbate–slab configurations from bulk crystal structures for heterogeneous catalysis studies, integrating Materials Project retrieval, slab generation, and adsorption-site enumeration. | 15 | 3 | 1 | 19.6 | --- ## 🗂️ Dataset Structure ### Configs The dataset exposes **3 configs**, with **one config per model**. Each config groups the annotated trace files associated with that model. ### Data Splits All configs expose a single `train` split. ### Data Instances Each row corresponds to one **annotated trace file** associated with a specific model. These rows contain the trace together with its **LLM epistemic annotations** and reflect cases where the annotator judged that the agent did not reason scientifically. --- ## 🏗️ Dataset Creation ### Curation Rationale This dataset was created as part of *Corral* to support targeted inspection of **scientific reasoning failures** beyond end-task success. By releasing model-specific trace annotations generated by an automatic annotator, it provides a focused resource for analyzing epistemic failure patterns. ### Source Data The traces were derived from *Corral* evaluation runs across environments and models. A downstream **LLM annotator** labeled the traces with epistemic annotations and identified cases suggesting that the agent did not reason scientifically. The resulting records were organized into model-specific configs, with each retained row corresponding to one annotated file. --- ## 🔗 Relation to Other Corral Artifacts This dataset is one component of the broader *Corral* release and is best interpreted together with the matching task definitions, execution traces, reports, aggregate results, and reasoning annotations available in the [*Corral* collection](https://huggingface.co/collections/jablonkagroup/corral). --- ## 📄 Citation ```bibtex @article{ríos-garcía2026ai, title = {AI scientists produce results without reasoning scientifically}, author = {Martiño Ríos-García and Nawaf Alampara and Chandan Gupta and Indrajeet Mandal and Sajid Mannan and Ali Asghar Aghajani and N. M. Anoop Krishnan and Kevin Maik Jablonka}, year = {2026}, journal = {arXiv preprint arXiv: 2604.18805} } ``` ## 📜 License This dataset is released under the [MIT License](https://opensource.org/licenses/MIT). ## Changelog ### 2026-04-22 - Initial release of the dataset card.

--- pretty_name: Corral — 推理注释 language: - zh license: - mit multilinguality: - 单语言 source_datasets: - 原创 task_categories: - 文本分类 annotations_creators: - 机器生成 language_creators: - 专家生成 - 机器生成 tags: - corral - benchmark - llm-agents - scientific-agents - traces - annotations - llm-annotation - evaluation - chemistry - materials-science - knowledge - reasoning - reasoning-annotations - epistemic-patterns - process-evaluation dataset_version: 0.0.1 dataset_release_date: '2026-04-22' --- # *Corral* — 推理注释 <div align="center"> ![Corral Logo](corral_logo_final.png) [![网站](https://img.shields.io/badge/🌐-Website-green)](https://lamalab-org.github.io/corral/) [![文档](https://img.shields.io/badge/📚-Docs-blue)](https://lamalab-org.github.io/corral/docs/) [![GitHub](https://img.shields.io/badge/💻-Code-black?logo=github)](https://github.com/lamalab-org/corral) [![许可证: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![论文](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2604.18805) [![🤗 Hugging Face 数据集](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/corral-reasoning-annotations) 针对Corral轨迹生成的大语言模型（LLM）认知注释，其中注释者判定对应智能体未进行科学推理 </div> --- ## 📋 数据集摘要本数据集是配合论文《*AI科学家未通过科学推理生成成果*》(https://arxiv.org/abs/2604.18805) 发布的*Corral*系列数据集之一。其包含*Corral*基准测试中，带有大语言模型（LLM）生成的认知注释的**带注释评估轨迹**。本数据集划分为**3种配置**，每种配置对应一个生成轨迹的模型。在每种配置中，每一行对应一个待注释文件，代表来自该模型的带注释轨迹实例。本数据集包含的注释由**大语言模型注释器**生成，其将这些案例标记为智能体未进行科学推理的场景。本资源旨在用于科学智能体行为的审计、定性分析与过程级研究，而非通用模型预训练。 ### 🎯 支持用途 - 🧠 针对科学智能体轨迹开展大语言模型（LLM）生成的认知注释研究 - 📊 审计自动注释器标记的非科学推理案例 - 📐 对比不同模型与轨迹文件间的注释推理失败情况 - 🔁 构建科学智能体推理过程研究的定性分析数据集 --- ## 🧪 关于*Corral* [*Corral*](https://lamalab-org.github.io/corral/) 是面向「智能体科学与面向科学的智能体」的框架。它通过客户端-服务器设计（REST API）实现**智能体与环境解耦**的微服务架构，确保灵活性、可复现性与强隔离性。 - 🌍 **环境**：定义任务空间、可用工具与可观测反馈，覆盖化学实验室到高性能计算集群等场景。 - 🤖 **智能体**：为基于大语言模型（LLM）的模块化实体，支持ReAct、ToolCalling、LLMPlanner与Reflection等架构。 - 📝 **任务**：定义待解决的问题，并配备评分函数。任务可被整合为任务组（TaskGroups）以实现复杂多阶段挑战。目前*Corral*包含**8个环境**、**97种工具**、**115项任务**与**786个子任务**，覆盖化学、物理与材料科学领域。 ### 🌍 环境 | 环境名称 | 描述 | 🔧 工具数量 | 📝 任务/范围 | 🔭 覆盖范围数 | ⏱️ 平均轨迹长度 | |---|---|:---:|:---:|:---:|:---:| | 🧫 **无机定性分析** | 通过系统化湿实验流程（试剂添加、焰色试验、pH测量、离心分离等）鉴定溶液中的未知阳离子。观测结果由热力学数据计算得到，共3种覆盖范围，候选离子数量逐步递增。 | 14 | 10 | 3 | 39.4 | | ⚡ **电路推断** | 根据成对电阻测量结果还原隐藏电阻网络的拓扑结构与元件参数。工具支持串并联计算、三角-星型变换与电路验证。 | 9 | 6 | 1 | 15.0 | | 🔭 **光谱结构解析** | 通过请求并解析光谱数据（质谱、核磁共振氢谱、异核单量子相干谱、红外光谱），结合化学位移与同位素分布参考数据库，确定未知化合物的分子结构。 | 16 | 20 | 2 | 15.1 | | 🧬 **逆合成设计** | 基于模板目录与官能团检测工具，在成本、步骤数与商业化可得性约束下，设计目标分子的多步合成路线。 | 15 | 8 | 3 | 25.5 | | 🤖 **基于机器学习的性质预测** | 组装完整机器学习流水线，利用材料项目（Materials Project）数据库的数据预测材料多形体的形成能，覆盖特征工程、XGBoost训练与交叉验证全流程。 | 14 | 3 | 1 | 16.6 | | 🔬 **原子力显微镜实验执行** | 分析并解读原子力显微镜数据以表征纳米级表面，包括形貌与机械性能测量。 | 6 | 1 | 4 | 26.3 | | ⚛️ **分子模拟** | 利用LAMMPS设计并执行分子动力学模拟以预测材料性质，覆盖从晶体结构检索、力场查询到日志分析的完整工作流。 | 8 | 2–3 | 2 | 30.4 | | 🏗️ **吸附表面构建** | 从体相晶体结构构建吸附质-表面slab构型以开展多相催化研究，整合材料项目检索、表面slab生成与吸附位点枚举工具。 | 15 | 3 | 1 | 19.6 | --- ## 🗂️ 数据集结构 ### 配置本数据集包含**3种配置**，每种配置对应一个模型，将该模型生成的带注释轨迹文件归为一组。 ### 数据划分所有配置均仅包含一个`train`划分。 ### 数据实例每一行对应一个与特定模型关联的**带注释轨迹文件**，包含轨迹本身及其**大语言模型（LLM）认知注释**，对应注释者判定智能体未进行科学推理的场景。 --- ## 🏗️ 数据集创建 ### 整理依据本数据集作为*Corral*项目的一部分创建，旨在支持针对**科学推理失败**的精细化审查，而非仅关注任务最终结果的成功与否。通过发布由自动注释器生成的模型专属轨迹注释，本数据集为认知失败模式（epistemic patterns）的分析提供了聚焦性资源。 ### 源数据轨迹源自跨环境与模型的*Corral*评估运行结果。下游的**大语言模型注释器**为轨迹添加认知注释，并标记出智能体未进行科学推理的案例。最终生成的记录被整理为模型专属配置，每一行保留的条目对应一个带注释的文件。 --- ## 🔗 与其他Corral相关制品的关联本数据集是更广泛的*Corral*发布内容的组成部分，建议结合[*Corral*数据集集合](https://huggingface.co/collections/jablonkagroup/corral)中提供的对应任务定义、执行轨迹、报告、汇总结果与推理注释一同使用。 --- ## 📄 引用 bibtex @article{ríos-garcía2026ai, title = {AI scientists produce results without reasoning scientifically}, author = {Martiño Ríos-García and Nawaf Alampara and Chandan Gupta and Indrajeet Mandal and Sajid Mannan and Ali Asghar Aghajani and N. M. Anoop Krishnan and Kevin Maik Jablonka}, year = {2026}, journal = {arXiv preprint arXiv: 2604.18805} } ## 📜 许可证本数据集采用[MIT许可证](https://opensource.org/licenses/MIT)发布。 ## 📝 变更日志 ### 2026-04-22 - 本数据集卡片首次发布。

提供机构：

jablonkagroup

5,000+

优质数据集

54 个

任务类型

进入经典数据集