mazri24/ifc_ner_dataset
收藏Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mazri24/ifc_ner_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
task_categories:
- token-classification
task_ids:
- named-entity-recognition
pretty_name: IFC NER Dataset
size_categories:
- 100K<n<1M
tags:
- bim
- ifc
- aec
- openbim
- named-entity-recognition
- ner
- synthetic-data
- ai
---
# IFC NER Dataset
A synthetic Natural Language Processing (NLP) dataset for Named Entity Recognition (NER) in the Architecture, Engineering, and Construction (AEC) domain, generated from Industry Foundation Classes (IFC) models.
This dataset enables training models to extract structured BIM information from natural language sentences.
⸻
## 📌 Overview
The dataset consists of automatically generated sentences derived from IFC files.
Each sentence is:
- Tokenized
- Labeled using the BIO tagging scheme
The dataset supports extraction of:
- IFC entities (e.g., IfcWall, IfcDoor)
- Relations (e.g., has, contains, part_of)
- Properties (e.g., Height, FireRating)
- Materials
- Property values
⸻
## 📂 Dataset Structure
The dataset follows the Hugging Face DatasetDict format:
- train
- validation
- test
Each example looks like:
``` python
{
"tokens": ["The", "wall", "has", "height", "3.0", "meters"],
"bio": ["O", "B-IFC_ENTITY", "B-RELATION", "B-PROPERTY", "O", "O"]
}
```
Where:
- tokens: Tokenized sentence
- bio: BIO-formatted NER tags aligned with tokens
BIO Tag meaning:
- B-XXX → Beginning of entity type
- I-XXX → Inside entity type
- O → Outside any entity
⸻
## 🏗 Data Generation
Sentences are programmatically generated from real IFC models.
The generation includes:
- Entity mention templates
- Property description templates
- Relationship-based sentences
- Passive voice variants
- Question-style sentences
- Multi-property sentences
This provides structured consistency with real BIM data while introducing controlled linguistic variation.
⸻
## 🎯 Intended Use
This dataset is suitable for:
- Token classification (NER)
- Domain adaptation for AEC NLP models
- Text → structured BIM workflows
- Text → IDS generation pipelines
- BIM-aware information extraction systems
⸻
## 📊 Scale
- Generated from IFC models
- 300k+ labeled sentences
- BIO token-level annotation
⸻
## ⚠️ Limitations
- Sentences are synthetically generated.
- Linguistic diversity depends on template variety.
- Does not cover the full IFC schema.
- Distribution reflects the source IFC models used.
---
language:
- 英语(en)
license: MIT协议
task_categories:
- 词元分类(token-classification)
task_ids:
- 命名实体识别(named-entity-recognition)
pretty_name: IFC NER 数据集(IFC NER Dataset)
size_categories:
- 10万<样本量<100万
tags:
- 建筑信息模型(BIM)
- 工业基础类(IFC)
- 建筑工程施工(AEC)
- 开放建筑信息模型(OpenBIM)
- 命名实体识别(named-entity-recognition)
- 命名实体识别(NER)
- 合成数据集(synthetic-data)
- 人工智能(AI)
---
# IFC NER 数据集
本数据集为面向建筑工程施工(Architecture, Engineering, and Construction, AEC)领域命名实体识别(Named Entity Recognition, NER)任务的合成自然语言处理(Natural Language Processing, NLP)数据集,由工业基础类(Industry Foundation Classes, IFC)模型生成。本数据集可用于训练模型,从自然语言语句中提取结构化建筑信息模型(Building Information Modeling, BIM)信息。
⸻
## 📌 概述
本数据集包含从IFC文件自动生成的语句。每条语句均经过以下处理:
- 分词操作
- 采用BIO标记方案进行标注
本数据集支持提取以下内容:
- IFC实体(例如:IfcWall、IfcDoor)
- 语义关系(例如:has、contains、part_of)
- 实体属性(例如:Height、FireRating)
- 建筑材料
- 属性取值
⸻
## 📂 数据集结构
本数据集遵循Hugging Face DatasetDict格式,包含训练集(train)、验证集(validation)与测试集(test)。
每条数据示例如下:
python
{
"tokens": ["The", "wall", "has", "height", "3.0", "meters"],
"bio": ["O", "B-IFC_ENTITY", "B-RELATION", "B-PROPERTY", "O", "O"]
}
其中:
- tokens:分词后的语句序列
- bio:与分词序列对齐的BIO格式命名实体识别标注标签
BIO标记含义:
- B-XXX:实体类别的起始标记
- I-XXX:实体类别的内部标记
- O:不属于任何实体的标记
⸻
## 🏗 数据生成流程
语句通过程序化方式从真实IFC模型生成。生成流程涵盖以下内容:
- 实体提及模板
- 属性描述模板
- 基于语义关系的语句模板
- 被动语态变体语句
- 疑问句式语句
- 多属性语句
该生成方式在保证与真实BIM数据结构一致性的同时,引入了可控的语言多样性。
⸻
## 🎯 预期应用场景
本数据集适用于:
- 词元分类(命名实体识别)任务
- 建筑工程施工领域自然语言处理模型的领域自适应
- 文本→结构化BIM信息转换工作流
- 文本→信息交付规范(Information Delivery Specification, IDS)生成流水线
- 支持BIM感知的信息抽取系统
⸻
## 📊 数据规模
- 源自真实IFC模型
- 包含30万条以上带标注语句
- 采用词元级BIO标注方案
⸻
## ⚠️ 局限性
- 所有语句均为合成生成
- 语言多样性受限于模板的丰富程度
- 未覆盖完整的IFC标准架构
- 数据分布反映了所用源IFC模型的特征
提供机构:
mazri24



