jablonkagroup/corral-QAs-reports

Name: jablonkagroup/corral-QAs-reports
Creator: jablonkagroup
Published: 2026-04-22 12:15:46
License: 暂无描述

Hugging Face2026-04-22 更新2026-05-10 收录

下载链接：

https://hf-mirror.com/datasets/jablonkagroup/corral-QAs-reports

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: Corral – QA Reports language: - en license: - mit multilinguality: - monolingual source_datasets: - original task_categories: - text-generation - question-answering annotations_creators: - machine-generated language_creators: - expert-generated - machine-generated tags: - corral - benchmark - llm-agents - scientific-agents - qa - reports - model-completions - evaluation - chemistry - materials-science - knowledge - reasoning - irt - item-response-theory dataset_version: 0.0.1 dataset_release_date: '2026-04-22' --- # *Corral* – QA Reports <div align="center"> ![Corral Logo](corral_logo_final.png) [![Website](https://img.shields.io/badge/🌐-Website-green)](https://lamalab-org.github.io/corral/) [![Docs](https://img.shields.io/badge/📚-Docs-blue)](https://lamalab-org.github.io/corral/docs/) [![GitHub](https://img.shields.io/badge/💻-Code-black?logo=github)](https://github.com/lamalab-org/corral) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Paper](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2604.18805) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-yellow)](https://huggingface.co/datasets/jablonkagroup/corral-QAs-reports) Model completions for question-answer evaluations probing factual knowledge and reasoning across Corral environments </div> --- ## 📋 Dataset Summary This dataset is part of the *Corral* collection accompanying the paper [*AI scientists produce results without reasoning scientifically*](https://arxiv.org/abs/2604.18805). It contains the **model completions and reports** for the **question-answer evaluations** used to test the **factual knowledge** and **reasoning ability** of models across *Corral* environments. The dataset is organized into **50 configurations**, with one configuration for each available combination of **environment**, **model**, and evaluation dimension (**knowledge** or **reasoning**). For example, a config encodes the reports generated by one model on either the knowledge-focused or reasoning-focused QA set for a given Corral environment. These completions correspond to the QA items used in the **Item Response Theory (IRT)** analyses reported in the *Corral* study, where the underlying knowledge and reasoning QAs serve as indicators for the latent **knowledge** and **reasoning** factors. This resource is intended for evaluation, psychometric modeling, and analysis of scientific-agent capabilities rather than for general-purpose model pre-training. ### 🎯 Supported Uses - 🧠 Analyzing model completions on factual-knowledge and scientific-reasoning QAs across Corral environments - 📊 Reproducing and extending the IRT analyses reported in the paper - 📐 Studying latent knowledge and reasoning factors through model response behavior - 🔁 Building meta-evaluation datasets for model comparison, reporting, and capability analysis --- ## 🧪 About *Corral* [*Corral*](https://lamalab-org.github.io/corral/) is a framework for the *science of agents and agents for science*. It provides a microservice architecture that **decouples agents from environments** via a client–server design (REST API), ensuring flexibility, reproducibility, and robust isolation. - 🌍 **Environments** define the task space, available tools, and observable feedback — from chemistry labs to HPC clusters. - 🤖 **Agents** are modular LLM-based entities supporting scaffolds such as ReAct, ToolCalling, LLMPlanner, and Reflection. - 📝 **Tasks** define problems to solve, complete with scoring functions. Tasks can be chained into TaskGroups for complex multi-stage challenges. *Corral* currently ships **8 environments**, **97 tools**, **115 tasks**, and **786 subtasks** spanning chemistry, physics, and materials science. ### 🌍 Environments | Environment | Description | 🔧 Tools | 📝 Tasks/scope | 🔭 Scopes | ⏱️ Avg. trace length | |---|---|:---:|:---:|:---:|:---:| | 🧫 **Inorganic Qualitative Analysis** | Identify unknown cations in solution through systematic wet-lab procedures (reagent addition, flame tests, pH measurement, centrifugation, etc.). Observations are computed from thermodynamic data. Three scopes progressively increase the number of candidate ions. | 14 | 10 | 3 | 39.4 | | ⚡ **Circuit Inference** | Recover the topology and component values of a hidden resistor network from pairwise resistance measurements. Tools provide series/parallel calculations, delta-wye transforms, and circuit validation. | 9 | 6 | 1 | 15.0 | | 🔭 **Spectroscopic Structure Elucidation** | Determine the molecular structure of an unknown compound by requesting and interpreting spectroscopic data (MS, NMR, HSQC, IR) alongside reference databases for chemical shifts and isotope distributions. | 16 | 20 | 2 | 15.1 | | 🧬 **Retrosynthetic Planning** | Design multi-step synthetic routes to target molecules under cost, step-count, and commercial-availability constraints, using a template catalogue and functional-group detection tools. | 15 | 8 | 3 | 25.5 | | 🤖 **ML-based Property Prediction** | Assemble a complete ML pipeline to predict formation energies of material polymorphs using data from the Materials Project, covering feature engineering, XGBoost training, and cross-validation. | 14 | 3 | 1 | 16.6 | | 🔬 **AFM Experiment Execution** | Analyze and interpret atomic force microscopy data for nanoscale surface characterization, including topographical and mechanical property measurements. | 6 | 1 | 4 | 26.3 | | ⚛️ **Molecular Simulation** | Design and execute molecular dynamics simulations with LAMMPS to predict materials properties, covering the full workflow from crystal structure retrieval to force-field queries and log analysis. | 8 | 2–3 | 2 | 30.4 | | 🏗️ **Adsorption Surface Construction** | Build adsorbate–slab configurations from bulk crystal structures for heterogeneous catalysis studies, integrating Materials Project retrieval, slab generation, and adsorption-site enumeration. | 15 | 3 | 1 | 19.6 | --- ## 🗂️ Dataset Structure ### Configs Each config name encodes `{environment}_{model}_{dimension}`, where: - `environment` is a short identifier for one of the 8 *Corral* environments (e.g., `afm`, `circuit_inference`, `spectroscopic`, `retrosynthesis`, `ml_property`, `molecular_simulation`, `adsorption`). - `model` identifies the model whose completions are included in that configuration. - `dimension` is either `knowledge` or `reasoning`. This yields **50 total configs**, one for each available environment-model pair and knowledge/reasoning combination. ### Data Splits All configs expose a single `train` split. ### Data Instances Each row corresponds to one **model completion/report** for a specific question-answer item, associated with a particular Corral environment, model, and evaluation dimension: knowledge or reasoning. --- ## 🏗️ Dataset Creation ### Curation Rationale This dataset was created as part of *Corral* to measure scientific-agent capabilities beyond end-task success by collecting model outputs on targeted QA items that separate **factual knowledge** from **reasoning ability** and support IRT-based analysis. ### Source Data The underlying QAs were derived from the task content, domain knowledge, and reasoning demands of the *Corral* benchmark environments. The dataset released here contains the corresponding model completions/reports on those QAs, organized by environment, model, and evaluation dimension, and used in the analyses of latent knowledge and reasoning factors. --- ## 🔗 Relation to Other Corral Artifacts This dataset is one component of the broader *Corral* release and is best interpreted together with the matching task definitions, execution traces, reports, aggregate results, and reasoning annotations available in the [*Corral* collection](https://huggingface.co/collections/jablonkagroup/corral). --- ## 📄 Citation ```bibtex @article{ríos-garcía2026ai, title = {AI scientists produce results without reasoning scientifically}, author = {Martiño Ríos-García and Nawaf Alampara and Chandan Gupta and Indrajeet Mandal and Sajid Mannan and Ali Asghar Aghajani and N. M. Anoop Krishnan and Kevin Maik Jablonka}, year = {2026}, journal = {arXiv preprint arXiv: 2604.18805} } ``` ## 📜 License This dataset is released under the [MIT License](https://opensource.org/licenses/MIT). ## Changelog ### 2026-04-22 - Initial release of the dataset card.

提供机构：

jablonkagroup

5,000+

优质数据集

54 个

任务类型

进入经典数据集