yongqiqng/CivilInstruct-Sample

Name: yongqiqng/CivilInstruct-Sample
Creator: yongqiqng
Published: 2026-04-10 05:02:32
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/yongqiqng/CivilInstruct-Sample

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 task_categories: - text-generation language: - en tags: - code - structural-engineering - openseespy - scientific-modeling - instruction-tuning - reinforcement-learning size_categories: - 1K<n<10K configs: - config_name: sft data_files: - split: train path: data/sft_train.parquet - split: validation path: data/sft_val.parquet - config_name: rl data_files: - split: train path: data/rl_train.parquet - split: test path: data/rl_test.parquet --- # CivilInstruct-Sample A 10% sample of the **CivilInstruct** dataset from the paper *Rethinking Scientific Modeling: Toward Physically Consistent and Simulation-Executable Programmatic Generation*. This sample is released for **demonstration and reproducibility** of the AutoBM training pipeline. The full dataset (10,912 samples) will be released upon paper publication. ## Overview CivilInstruct is a domain-specific instruction dataset for training LLMs to generate executable, physically consistent **OpenSeesPy** structural modeling code from natural language building specifications. It integrates four data sources: | Part | Description | Full | Sample (this repo) | |------|-------------|------|--------------------| | Part 1 | Fine-grained supervised data (OpenSeesPy API tutorials) | 3,881 | ~388 | | Part 2 | Parameterized generated long code data | 3,100 | ~310 | | Part 3 | Execution error-oriented debugging CoT data | 3,500 | ~350 | | Part 4 | Physics-informed expert data (with ground-truth periods) | 512 | ~51 | ## Subsets ### `sft` — Supervised Fine-Tuning Used for Stage I training (instruction tuning). | Split | Samples | Fields | |-------|---------|--------| | `train` | 989 | `instruction`, `response` | | `validation` | 20 | `instruction`, `response` | ### `rl` — Reinforcement Learning (SPC-GRPO) Used for Stage II training with Multi-Granularity Hybrid Reward (MGHR). | Split | Samples | Fields | |-------|---------|--------| | `train` | 45 | `data_source`, `prompt`, `ability`, `reward_model`, `extra_info` | | `test` | 5 | `data_source`, `prompt`, `ability`, `reward_model`, `extra_info` | The `reward_model.ground_truth` field contains the structural fundamental period (T1) used for physics-aware reward grading. ## Usage ```python from datasets import load_dataset # SFT data sft_train = load_dataset("yongqiqng/CivilInstruct-Sample", "sft", split="train") print(sft_train[0]) # RL data rl_train = load_dataset("yongqiqng/CivilInstruct-Sample", "rl", split="train") print(rl_train[0]) ``` ## Related - Paper: [arXiv:2602.07083](https://arxiv.org/abs/2602.07083) - Code: [github.com/Jovanqing/AutoBM](https://github.com/Jovanqing/AutoBM) - Model: [Jovanqing/Seed-Coder-8B-Reasoning-SFT](https://huggingface.co/Jovanqing/Seed-Coder-8B-Reasoning-SFT) - Full dataset: *Coming soon upon paper publication* ## Citation ```bibtex @article{jiang2026rethinking, title={Rethinking Scientific Modeling: Toward Physically Consistent and Simulation-Executable Programmatic Generation}, author={Jiang, Yongqing and Wang, Jianze and Shen, Zhiqi and Lin, Zhenghong and Wang, Jiayuan and Yang, Yijian and Dai, Kaoshan and Luo, Haoran}, journal={arXiv preprint arXiv:2602.07083}, year={2026} } ``` ## License This sample is released under **CC BY-NC 4.0** (non-commercial). For commercial inquiries, please contact the authors.

提供机构：

yongqiqng

5,000+

优质数据集

54 个

任务类型

进入经典数据集