five

cmudrc/OpenSeeSimE-Fluid-Small

收藏
Hugging Face2026-04-24 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/cmudrc/OpenSeeSimE-Fluid-Small
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: file_name dtype: string - name: source_file dtype: string - name: question dtype: string - name: question_type dtype: string - name: question_id dtype: int32 - name: answer dtype: string - name: answer_choices list: string - name: correct_choice_idx dtype: int32 - name: image dtype: image - name: video dtype: video - name: media_type dtype: string splits: - name: test num_examples: 9881 configs: - config_name: default data_files: - split: test path: data/test-* license: mit task_categories: - visual-question-answering language: - en size_categories: - 1K<n<10K tags: - engineering - simulation - stratified-subset --- # OpenSeeSimE-Fluid-Small A **stratified 10% subset** of [`cmudrc/OpenSeeSimE-Fluid`](https://huggingface.co/datasets/cmudrc/OpenSeeSimE-Fluid) for evaluating vision-language models at a reduced compute footprint while preserving the joint distribution of simulation type, question type, media type, and question id. ## Subset Provenance - **Parent dataset**: [`cmudrc/OpenSeeSimE-Fluid`](https://huggingface.co/datasets/cmudrc/OpenSeeSimE-Fluid) (98,326 rows total) - **Rows in this subset**: **9,881** (10.05% of parent) - **Source classes**: `Bent Pipe`, `Converging Nozzle`, `Heat Exchanger`, `Heat Sink`, `Mixing Pipe` - **Parquet shards**: 19 | **Storage**: ~103.68 GB - **Sampling**: per-stratum shuffle with `numpy.random.default_rng(42)`, then take `ceil(n * fraction)` from each stratum. Any non-empty stratum contributes at least 1 row. - **Strata**: `(source_file, question_type, media_type, question_id)` — all four jointly. - **Nesting**: the 1% subset is a literal subset of the 10% subset (same shuffled prefix is taken for every fraction). ## Composition ### By `source_file` | source_file | rows | pct | |:------------------|-------:|------:| | Mixing Pipe | 2070 | 20.95 | | Heat Exchanger | 2029 | 20.53 | | Bent Pipe | 1976 | 20.00 | | Converging Nozzle | 1971 | 19.95 | | Heat Sink | 1835 | 18.57 | ### By `media_type` | media_type | rows | |:-------------|-------:| | image | 4948 | | video | 4933 | ### By `(source_file, question_type)` | source_file | Binary | Multiple Choice | Spatial | Total | |:------------------|---------:|------------------:|----------:|--------:| | Bent Pipe | 792 | 796 | 388 | 1976 | | Converging Nozzle | 791 | 789 | 391 | 1971 | | Heat Exchanger | 812 | 811 | 406 | 2029 | | Heat Sink | 719 | 710 | 406 | 1835 | | Mixing Pipe | 828 | 828 | 414 | 2070 | ## Feature Schema Identical to the parent dataset. See [`cmudrc/OpenSeeSimE-Fluid`](https://huggingface.co/datasets/cmudrc/OpenSeeSimE-Fluid) for full documentation of simulation generation, ground-truth extraction, preprocessing, limitations, and intended use. ```python { 'file_name': str, # Unique identifier 'source_file': str, # Base simulation model 'question': str, # Question text 'question_type': str, # 'Binary', 'Multiple Choice', 'Spatial' 'question_id': int, # Question identifier (1-20) 'answer': str, # Ground truth answer 'answer_choices': list[str], # Options 'correct_choice_idx': int, # Index of correct answer 'image': Image, # PIL Image (1920x1440) or null for video rows 'video': Video, # Video bytes or null for image rows 'media_type': str, # 'image' or 'video' } ``` ## Intended Use - Benchmark evaluation of vision-language models on engineering simulation question answering at reduced compute cost - Smoke-testing of evaluation pipelines before running the full benchmark - Comparative studies where storage or bandwidth constraints matter ## License MIT — same as parent. Free for academic and commercial use with attribution. ## Citation ```bibtex @article{ezemba2024opensesime, title={OpenSeeSimE: A Large-Scale Benchmark to Assess Vision-Language Model Question Answering Capabilities in Engineering Simulations}, author={Ezemba, Jessica and Pohl, Jason and Tucker, Conrad and McComb, Christopher}, year={2025} } ``` ## Contact **Jessica Ezemba** — jezemba@andrew.cmu.edu Department of Mechanical Engineering, Carnegie Mellon University

数据集信息: 特征: - 名称:file_name 数据类型:string(字符串) - 名称:source_file 数据类型:string(字符串) - 名称:question 数据类型:string(字符串) - 名称:question_type 数据类型:string(字符串) - 名称:question_id 数据类型:int32 - 名称:answer 数据类型:string(字符串) - 名称:answer_choices 数据类型:string列表 - 名称:correct_choice_idx 数据类型:int32 - 名称:image 数据类型:image(图像) - 名称:video 数据类型:video(视频) - 名称:media_type 数据类型:string(字符串) 划分: - 名称:test 样本量:9881 配置: - 配置名称:default 数据文件: - 划分:test 路径:data/test-* 许可证:mit 任务类别: - 视觉问答(visual question answering) 语言: - 英语(en) 规模类别: - 1K<n<10K 标签: - 工程(engineering) - 仿真(simulation) - 分层子集(stratified-subset) # OpenSeeSimE-Fluid-Small 本数据集是[`cmudrc/OpenSeeSimE-Fluid`](https://huggingface.co/datasets/cmudrc/OpenSeeSimE-Fluid)的**分层10%子集**,用于在降低计算开销的同时评估视觉语言模型(vision-language models),且保留了仿真类型、问题类型、媒体类型与问题ID的联合分布。 ## 子集溯源 - **父数据集**:[`cmudrc/OpenSeeSimE-Fluid`](https://huggingface.co/datasets/cmudrc/OpenSeeSimE-Fluid)(总计98326条样本) - **本子集样本量**:**9881条**(占父数据集的10.05%) - **源类别**:`弯管(Bent Pipe)`、`收敛喷嘴(Converging Nozzle)`、`换热器(Heat Exchanger)`、`散热器(Heat Sink)`、`混合管(Mixing Pipe)` - **Parquet分片**:19个 | **存储占用**:约103.68 GB - **采样策略**:采用`numpy.random.default_rng(42)`进行分层打乱,随后从每个分层中抽取`ceil(n * 采样比例)`条样本,任意非空分层至少贡献1条样本。 - **分层依据**:`(source_file, question_type, media_type, question_id)`——四项联合分层。 - **嵌套关系**:1%子集是10%子集的严格子集(对所有采样比例均采用相同的打乱前缀进行截取)。 ## 数据集构成 ### 按`source_file`分组 | 源文件名称 | 样本量 | 占比 | |:------------------|-------:|------:| | 混合管(Mixing Pipe) | 2070 | 20.95 | | 换热器(Heat Exchanger) | 2029 | 20.53 | | 弯管(Bent Pipe) | 1976 | 20.00 | | 收敛喷嘴(Converging Nozzle) | 1971 | 19.95 | | 散热器(Heat Sink) | 1835 | 18.57 | ### 按`media_type`分组 | 媒体类型 | 样本量 | |:-------------|-------:| | 图像(image) | 4948 | | 视频(video) | 4933 | ### 按`(source_file, question_type)`分组 | 源文件名称 | 二分类问题(Binary) | 多项选择题(Multiple Choice) | 空间问题(Spatial) | 总计 | |:------------------|---------:|------------------:|----------:|--------:| | 弯管(Bent Pipe) | 792 | 796 | 388 | 1976 | | 收敛喷嘴(Converging Nozzle) | 791 | 789 | 391 | 1971 | | 换热器(Heat Exchanger) | 812 | 811 | 406 | 2029 | | 散热器(Heat Sink) | 719 | 710 | 406 | 1835 | | 混合管(Mixing Pipe) | 828 | 828 | 414 | 2070 | ## 特征模式 本数据集的特征模式与父数据集完全一致。有关仿真生成、真值提取、预处理、局限性与预期用途的完整说明,请参阅[`cmudrc/OpenSeeSimE-Fluid`](https://huggingface.co/datasets/cmudrc/OpenSeeSimE-Fluid)。 python { 'file_name': str, # 唯一标识符 'source_file': str, # 基础仿真模型 'question': str, # 问题文本 'question_type': str, # 可选值:'二分类(Binary)'、'多项选择(Multiple Choice)'、'空间问题(Spatial)' 'question_id': int, # 问题标识符(1-20) 'answer': str, # 标准答案 'answer_choices': list[str], # 选项列表 'correct_choice_idx': int, # 正确选项索引 'image': Image, # PIL图像(1920x1440),视频样本对应空值 'video': Video, # 视频字节流,图像样本对应空值 'media_type': str, # 可选值:'image(图像)' 或 'video(视频)' } ## 预期用途 - 基准测试:在降低计算成本的前提下,针对工程仿真问答任务开展视觉语言模型的基准测试 - 冒烟测试:在运行完整基准测试前,对评估流水线进行冒烟测试 - 对比研究:在存在存储或带宽限制的场景下开展对比研究 ## 许可证 采用MIT许可证,与父数据集一致。可免费用于学术与商业用途,需注明原作者出处。 ## 引用 bibtex @article{ezemba2024opensesime, title={OpenSeeSimE:用于评估视觉语言模型在工程仿真问答任务中能力的大规模基准测试集}, author={Ezemba, Jessica and Pohl, Jason and Tucker, Conrad and McComb, Christopher}, year={2025} } ## 联系方式 **杰西卡·埃曾巴(Jessica Ezemba)** —— jezemba@andrew.cmu.edu 卡内基梅隆大学机械工程系
提供机构:
cmudrc
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作