jablonkagroup/questions4manual_annotation
收藏Hugging Face2026-04-22 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/jablonkagroup/questions4manual_annotation
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Corral – Traces for Manual Annotation
language:
- en
license:
- mit
multilinguality:
- monolingual
source_datasets:
- original
task_categories:
- text-classification
annotations_creators:
- machine-generated
language_creators:
- expert-generated
- machine-generated
tags:
- corral
- benchmark
- llm-agents
- scientific-agents
- traces
- evaluation
- chemistry
- materials-science
- knowledge
- reasoning
- manual-annotation
- epistemic-patterns
- process-evaluation
dataset_version: 0.0.1
dataset_release_date: '2026-04-22'
---
# *Corral* – Traces for Manual Annotation
<div align="center">

[](https://lamalab-org.github.io/corral/)
[](https://lamalab-org.github.io/corral/docs/)
[](https://github.com/lamalab-org/corral)
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2604.18805)
[](https://huggingface.co/datasets/jablonkagroup/questions4manual_annotation)
Selected Corral traces for manual annotation of epistemic patterns across environments, models, and QA dimensions
</div>
---
## 📋 Dataset Summary
This dataset is part of the *Corral* collection accompanying the paper [*AI scientists produce results without reasoning scientifically*](https://arxiv.org/abs/2604.18805). It contains the **evaluation traces** selected for **manual annotation of epistemic patterns** across the *Corral* benchmark.
The dataset is organized into **1 configuration**, `default`. Within this configuration, each row corresponds to one combination of **environment × model** traces, sampled in a stratified manner from the full *Corral* trace collection.
The included traces were selected because an **LLM annotator** identified them as cases where the agents **do not reason scientifically**, making them a targeted subset for downstream human review and epistemic-pattern analysis. This resource is intended for annotation, qualitative analysis, and process-level study of scientific-agent behaviour rather than for general-purpose model pre-training.
### 🎯 Supported Uses
- 🧠 Manually annotating epistemic patterns in scientific-agent traces
- 📊 Auditing cases where an automatic annotator flagged non-scientific reasoning
- 📐 Comparing trace-level failure modes across environments, models, and QA dimensions
- 🔁 Building qualitative analysis sets for reasoning-process studies in scientific agents
---
## 🧪 About *Corral*
[*Corral*](https://lamalab-org.github.io/corral/) is a framework for the *science of agents and agents for science*. It provides a microservice architecture that **decouples agents from environments** via a client–server design (REST API), ensuring flexibility, reproducibility, and robust isolation.
- 🌍 **Environments** define the task space, available tools, and observable feedback — from chemistry labs to HPC clusters.
- 🤖 **Agents** are modular LLM-based entities supporting scaffolds such as ReAct, ToolCalling, LLMPlanner, and Reflection.
- 📝 **Tasks** define problems to solve, complete with scoring functions. Tasks can be chained into TaskGroups for complex multi-stage challenges.
*Corral* currently ships **8 environments**, **97 tools**, **115 tasks**, and **786 subtasks** spanning chemistry, physics, and materials science.
### 🌍 Environments
| Environment | Description | 🔧 Tools | 📝 Tasks/scope | 🔭 Scopes | ⏱️ Avg. trace length |
|---|---|:---:|:---:|:---:|:---:|
| 🧫 **Inorganic Qualitative Analysis** | Identify unknown cations in solution through systematic wet-lab procedures (reagent addition, flame tests, pH measurement, centrifugation, etc.). Observations are computed from thermodynamic data. Three scopes progressively increase the number of candidate ions. | 14 | 10 | 3 | 39.4 |
| ⚡ **Circuit Inference** | Recover the topology and component values of a hidden resistor network from pairwise resistance measurements. Tools provide series/parallel calculations, delta-wye transforms, and circuit validation. | 9 | 6 | 1 | 15.0 |
| 🔭 **Spectroscopic Structure Elucidation** | Determine the molecular structure of an unknown compound by requesting and interpreting spectroscopic data (MS, NMR, HSQC, IR) alongside reference databases for chemical shifts and isotope distributions. | 16 | 20 | 2 | 15.1 |
| 🧬 **Retrosynthetic Planning** | Design multi-step synthetic routes to target molecules under cost, step-count, and commercial-availability constraints, using a template catalogue and functional-group detection tools. | 15 | 8 | 3 | 25.5 |
| 🤖 **ML-based Property Prediction** | Assemble a complete ML pipeline to predict formation energies of material polymorphs using data from the Materials Project, covering feature engineering, XGBoost training, and cross-validation. | 14 | 3 | 1 | 16.6 |
| 🔬 **AFM Experiment Execution** | Analyze and interpret atomic force microscopy data for nanoscale surface characterization, including topographical and mechanical property measurements. | 6 | 1 | 4 | 26.3 |
| ⚛️ **Molecular Simulation** | Design and execute molecular dynamics simulations with LAMMPS to predict materials properties, covering the full workflow from crystal structure retrieval to force-field queries and log analysis. | 8 | 2–3 | 2 | 30.4 |
| 🏗️ **Adsorption Surface Construction** | Build adsorbate–slab configurations from bulk crystal structures for heterogeneous catalysis studies, integrating Materials Project retrieval, slab generation, and adsorption-site enumeration. | 15 | 3 | 1 | 19.6 |
---
## 🗂️ Dataset Structure
### Configs
Only `default` config is available, which includes the traces selected for manual annotation across all included environments x models combinations.
### Data Splits
All configs expose a single `train` split.
### Data Instances
Each row corresponds to one **selected trace** associated with a specific combination of Corral **environment**, **model**, and QA dimension (`knowledge` or `reasoning`). These rows are intended as annotation units for epistemic-pattern review.
---
## 🏗️ Dataset Creation
### Curation Rationale
This dataset was created as part of *Corral* to support targeted inspection of **scientific reasoning failures** beyond end-task success. By collecting traces flagged as non-scientific by an automatic annotator, it provides a focused subset for manual annotation of epistemic patterns.
### Source Data
The traces were derived from *Corral* evaluation runs across environments and models. A downstream **LLM annotator** identified cases suggesting that the agent did not reason scientifically, and those traces were collected here for subsequent manual review. Each retained row corresponds to one environment-model-dimension combination, where the dimension reflects the associated **knowledge** or **reasoning** QA setting.
---
## 🔗 Relation to Other Corral Artifacts
This dataset is one component of the broader *Corral* release and is best interpreted together with the matching task definitions, execution traces, reports, aggregate results, and reasoning annotations available in the [*Corral* collection](https://huggingface.co/collections/jablonkagroup/corral).
---
## 📄 Citation
```bibtex
@article{ríos-garcía2026ai,
title = {AI scientists produce results without reasoning scientifically},
author = {Martiño Ríos-García and Nawaf Alampara and Chandan Gupta and Indrajeet Mandal and Sajid Mannan and Ali Asghar Aghajani and N. M. Anoop Krishnan and Kevin Maik Jablonka},
year = {2026},
journal = {arXiv preprint arXiv: 2604.18805}
}
```
## 📜 License
This dataset is released under the [MIT License](https://opensource.org/licenses/MIT).
## Changelog
### 2026-04-22
- Initial release of the dataset card.
提供机构:
jablonkagroup



