DFKI-SLT/fabner

Name: DFKI-SLT/fabner
Creator: DFKI-SLT
Published: 2024-05-15 13:18:00
License: 暂无描述

Hugging Face2024-05-15 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/DFKI-SLT/fabner

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - found language: - en license: - other multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: [] task_categories: - token-classification task_ids: - named-entity-recognition pretty_name: FabNER is a manufacturing text dataset for Named Entity Recognition. tags: - manufacturing - 2000-2020 dataset_info: - config_name: fabner features: - name: id dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-MATE '2': I-MATE '3': E-MATE '4': S-MATE '5': B-MANP '6': I-MANP '7': E-MANP '8': S-MANP '9': B-MACEQ '10': I-MACEQ '11': E-MACEQ '12': S-MACEQ '13': B-APPL '14': I-APPL '15': E-APPL '16': S-APPL '17': B-FEAT '18': I-FEAT '19': E-FEAT '20': S-FEAT '21': B-PRO '22': I-PRO '23': E-PRO '24': S-PRO '25': B-CHAR '26': I-CHAR '27': E-CHAR '28': S-CHAR '29': B-PARA '30': I-PARA '31': E-PARA '32': S-PARA '33': B-ENAT '34': I-ENAT '35': E-ENAT '36': S-ENAT '37': B-CONPRI '38': I-CONPRI '39': E-CONPRI '40': S-CONPRI '41': B-MANS '42': I-MANS '43': E-MANS '44': S-MANS '45': B-BIOP '46': I-BIOP '47': E-BIOP '48': S-BIOP splits: - name: train num_bytes: 4394010 num_examples: 9435 - name: validation num_bytes: 934347 num_examples: 2183 - name: test num_bytes: 940136 num_examples: 2064 download_size: 1265830 dataset_size: 6268493 - config_name: fabner_bio features: - name: id dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-MATE '2': I-MATE '3': B-MANP '4': I-MANP '5': B-MACEQ '6': I-MACEQ '7': B-APPL '8': I-APPL '9': B-FEAT '10': I-FEAT '11': B-PRO '12': I-PRO '13': B-CHAR '14': I-CHAR '15': B-PARA '16': I-PARA '17': B-ENAT '18': I-ENAT '19': B-CONPRI '20': I-CONPRI '21': B-MANS '22': I-MANS '23': B-BIOP '24': I-BIOP splits: - name: train num_bytes: 4394010 num_examples: 9435 - name: validation num_bytes: 934347 num_examples: 2183 - name: test num_bytes: 940136 num_examples: 2064 download_size: 1258672 dataset_size: 6268493 - config_name: fabner_simple features: - name: id dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': MATE '2': MANP '3': MACEQ '4': APPL '5': FEAT '6': PRO '7': CHAR '8': PARA '9': ENAT '10': CONPRI '11': MANS '12': BIOP splits: - name: train num_bytes: 4394010 num_examples: 9435 - name: validation num_bytes: 934347 num_examples: 2183 - name: test num_bytes: 940136 num_examples: 2064 download_size: 1233960 dataset_size: 6268493 - config_name: text2tech features: - name: id dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': Technological System '2': Method '3': Material '4': Technical Field splits: - name: train num_bytes: 4394010 num_examples: 9435 - name: validation num_bytes: 934347 num_examples: 2183 - name: test num_bytes: 940136 num_examples: 2064 download_size: 1192966 dataset_size: 6268493 configs: - config_name: fabner data_files: - split: train path: fabner/train-* - split: validation path: fabner/validation-* - split: test path: fabner/test-* default: true - config_name: fabner_bio data_files: - split: train path: fabner_bio/train-* - split: validation path: fabner_bio/validation-* - split: test path: fabner_bio/test-* - config_name: fabner_simple data_files: - split: train path: fabner_simple/train-* - split: validation path: fabner_simple/validation-* - split: test path: fabner_simple/test-* - config_name: text2tech data_files: - split: train path: text2tech/train-* - split: validation path: text2tech/validation-* - split: test path: text2tech/test-* --- # Dataset Card for FabNER ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://figshare.com/articles/dataset/Dataset_NER_Manufacturing_-_FabNER_Information_Extraction_from_Manufacturing_Process_Science_Domain_Literature_Using_Named_Entity_Recognition/14782407](https://figshare.com/articles/dataset/Dataset_NER_Manufacturing_-_FabNER_Information_Extraction_from_Manufacturing_Process_Science_Domain_Literature_Using_Named_Entity_Recognition/14782407) - **Paper:** ["FabNER": information extraction from manufacturing process science domain literature using named entity recognition](https://par.nsf.gov/servlets/purl/10290810) - **Size of downloaded dataset files:** 3.79 MB - **Size of the generated dataset:** 6.27 MB ### Dataset Summary FabNER is a manufacturing text corpus of 350,000+ words for Named Entity Recognition. It is a collection of abstracts obtained from Web of Science through known journals available in manufacturing process science research. For every word, there were categories/entity labels defined, namely Material (MATE), Manufacturing Process (MANP), Machine/Equipment (MACEQ), Application (APPL), Features (FEAT), Mechanical Properties (PRO), Characterization (CHAR), Parameters (PARA), Enabling Technology (ENAT), Concept/Principles (CONPRI), Manufacturing Standards (MANS) and BioMedical (BIOP). Annotation was performed in all categories along with the output tag in 'BIOES' format: B=Beginning, I-Intermediate, O=Outside, E=End, S=Single. For details about the dataset, please refer to the paper: ["FabNER": information extraction from manufacturing process science domain literature using named entity recognition](https://par.nsf.gov/servlets/purl/10290810) ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages The language in the dataset is English. ## Dataset Structure ### Data Instances - **Size of downloaded dataset files:** 3.79 MB - **Size of the generated dataset:** 6.27 MB An example of 'train' looks as follows: ```json { "id": "0", "tokens": ["Revealed", "the", "location-specific", "flow", "patterns", "and", "quantified", "the", "speeds", "of", "various", "types", "of", "flow", "."], "ner_tags": [0, 0, 0, 46, 49, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] } ``` ### Data Fields #### fabner - `id`: the instance id of this sentence, a `string` feature. - `tokens`: the list of tokens of this sentence, a `list` of `string` features. - `ner_tags`: the list of entity tags, a `list` of classification labels. ```json {"O": 0, "B-MATE": 1, "I-MATE": 2, "O-MATE": 3, "E-MATE": 4, "S-MATE": 5, "B-MANP": 6, "I-MANP": 7, "O-MANP": 8, "E-MANP": 9, "S-MANP": 10, "B-MACEQ": 11, "I-MACEQ": 12, "O-MACEQ": 13, "E-MACEQ": 14, "S-MACEQ": 15, "B-APPL": 16, "I-APPL": 17, "O-APPL": 18, "E-APPL": 19, "S-APPL": 20, "B-FEAT": 21, "I-FEAT": 22, "O-FEAT": 23, "E-FEAT": 24, "S-FEAT": 25, "B-PRO": 26, "I-PRO": 27, "O-PRO": 28, "E-PRO": 29, "S-PRO": 30, "B-CHAR": 31, "I-CHAR": 32, "O-CHAR": 33, "E-CHAR": 34, "S-CHAR": 35, "B-PARA": 36, "I-PARA": 37, "O-PARA": 38, "E-PARA": 39, "S-PARA": 40, "B-ENAT": 41, "I-ENAT": 42, "O-ENAT": 43, "E-ENAT": 44, "S-ENAT": 45, "B-CONPRI": 46, "I-CONPRI": 47, "O-CONPRI": 48, "E-CONPRI": 49, "S-CONPRI": 50, "B-MANS": 51, "I-MANS": 52, "O-MANS": 53, "E-MANS": 54, "S-MANS": 55, "B-BIOP": 56, "I-BIOP": 57, "O-BIOP": 58, "E-BIOP": 59, "S-BIOP": 60} ``` #### fabner_bio - `id`: the instance id of this sentence, a `string` feature. - `tokens`: the list of tokens of this sentence, a `list` of `string` features. - `ner_tags`: the list of entity tags, a `list` of classification labels. ```json {"O": 0, "B-MATE": 1, "I-MATE": 2, "B-MANP": 3, "I-MANP": 4, "B-MACEQ": 5, "I-MACEQ": 6, "B-APPL": 7, "I-APPL": 8, "B-FEAT": 9, "I-FEAT": 10, "B-PRO": 11, "I-PRO": 12, "B-CHAR": 13, "I-CHAR": 14, "B-PARA": 15, "I-PARA": 16, "B-ENAT": 17, "I-ENAT": 18, "B-CONPRI": 19, "I-CONPRI": 20, "B-MANS": 21, "I-MANS": 22, "B-BIOP": 23, "I-BIOP": 24} ``` #### fabner_simple - `id`: the instance id of this sentence, a `string` feature. - `tokens`: the list of tokens of this sentence, a `list` of `string` features. - `ner_tags`: the list of entity tags, a `list` of classification labels. ```json {"O": 0, "MATE": 1, "MANP": 2, "MACEQ": 3, "APPL": 4, "FEAT": 5, "PRO": 6, "CHAR": 7, "PARA": 8, "ENAT": 9, "CONPRI": 10, "MANS": 11, "BIOP": 12} ``` #### text2tech - `id`: the instance id of this sentence, a `string` feature. - `tokens`: the list of tokens of this sentence, a `list` of `string` features. - `ner_tags`: the list of entity tags, a `list` of classification labels. ```json {"O": 0, "Technological System": 1, "Method": 2, "Material": 3, "Technical Field": 4} ``` ### Data Splits | | Train | Dev | Test | |--------|-------|------|------| | fabner | 9435 | 2183 | 2064 | ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{DBLP:journals/jim/KumarS22, author = {Aman Kumar and Binil Starly}, title = {"FabNER": information extraction from manufacturing process science domain literature using named entity recognition}, journal = {J. Intell. Manuf.}, volume = {33}, number = {8}, pages = {2393--2407}, year = {2022}, url = {https://doi.org/10.1007/s10845-021-01807-x}, doi = {10.1007/s10845-021-01807-x}, timestamp = {Sun, 13 Nov 2022 17:52:57 +0100}, biburl = {https://dblp.org/rec/journals/jim/KumarS22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ``` ### Contributions Thanks to [@phucdev](https://github.com/phucdev) for adding this dataset.

annotations_creators: - 专家生成 language_creators: - 采集自现有文本 language: - en license: - 其他 multilinguality: - 单语 size_categories: - 10000 < 样本量 < 100000 source_datasets: [] task_categories: - 令牌分类（Token Classification） task_ids: - 命名实体识别（Named Entity Recognition，NER） pretty_name: FabNER是一款面向命名实体识别的制造领域文本数据集 tags: - 制造领域 - 2000-2020 # FabNER 数据集卡片 ## 目录 - [目录](#目录) - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [支持任务与基准测试榜单](#支持任务与基准测试榜单) - [语言](#语言) - [数据集结构](#数据集结构) - [数据实例](#数据实例) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集构建](#数据集构建) - [构建初衷](#构建初衷) - [源数据](#源数据) - [标注信息](#标注信息) - [个人与敏感信息](#个人与敏感信息) - [数据使用注意事项](#数据使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [数据集维护者](#数据集维护者) - [许可信息](#许可信息) - [引用信息](#引用信息) - [贡献致谢](#贡献致谢) ## 数据集描述 - **主页**：[https://figshare.com/articles/dataset/Dataset_NER_Manufacturing_-_FabNER_Information_Extraction_from_Manufacturing_Process_Science_Domain_Literature_Using_Named_Entity_Recognition/14782407](https://figshare.com/articles/dataset/Dataset_NER_Manufacturing_-_FabNER_Information_Extraction_from_Manufacturing_Process_Science_Domain_Literature_Using_Named_Entity_Recognition/14782407) - **论文**：["FabNER": 基于命名实体识别的制造工艺科学领域文献信息抽取](https://par.nsf.gov/servlets/purl/10290810) - **下载数据集文件大小**：3.79 MB - **生成后数据集大小**：6.27 MB ### 数据集概述 FabNER是一个面向命名实体识别（Named Entity Recognition，NER）的制造领域文本语料库，包含超过35万个单词。该语料库采集自Web of Science数据库中制造工艺科学研究领域的知名期刊的摘要集合。为每个单词定义了分类/实体标签，分别为：材料（Material, MATE）、制造工艺（Manufacturing Process, MANP）、机器/设备（Machine/Equipment, MACEQ）、应用场景（Application, APPL）、特征属性（Features, FEAT）、力学性能（Mechanical Properties, PRO）、表征分析（Characterization, CHAR）、工艺参数（Parameters, PARA）、使能技术（Enabling Technology, ENAT）、概念/原理（Concept/Principles, CONPRI）、制造标准（Manufacturing Standards, MANS）以及生物医学（BioMedical, BIOP）。所有类别的标注均采用`BIOES`格式：B代表起始（Beginning）、I代表中间（Intermediate）、O代表外部（Outside）、E代表结束（End）、S代表单个实体（Single）。如需了解数据集的详细信息，请参考论文：["FabNER": 基于命名实体识别的制造工艺科学领域文献信息抽取](https://par.nsf.gov/servlets/purl/10290810) ### 支持任务与基准测试榜单 [更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言本数据集的语言为英语。 ## 数据集结构 ### 数据实例 - **下载数据集文件大小**：3.79 MB - **生成后数据集大小**：6.27 MB 训练集的一个示例如下： json { "id": "0", "tokens": ["Revealed", "the", "location-specific", "flow", "patterns", "and", "quantified", "the", "speeds", "of", "various", "types", "of", "flow", "."], "ner_tags": [0, 0, 0, 46, 49, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] } ### 数据字段 #### fabner配置 - `id`：该语句的实例ID，为字符串类型特征。 - `tokens`：该语句的Token列表，为字符串类型特征的列表。 - `ner_tags`：实体标签列表，为分类标签的列表。 json {"O": 0, "B-MATE": 1, "I-MATE": 2, "E-MATE": 3, "S-MATE": 4, "B-MANP": 5, "I-MANP": 6, "E-MANP": 7, "S-MANP": 8, "B-MACEQ": 9, "I-MACEQ": 10, "E-MACEQ": 11, "S-MACEQ": 12, "B-APPL": 13, "I-APPL": 14, "E-APPL": 15, "S-APPL": 16, "B-FEAT": 17, "I-FEAT": 18, "E-FEAT": 19, "S-FEAT": 20, "B-PRO": 21, "I-PRO": 22, "E-PRO": 23, "S-PRO": 24, "B-CHAR": 25, "I-CHAR": 26, "E-CHAR": 27, "S-CHAR": 28, "B-PARA": 29, "I-PARA": 30, "E-PARA": 31, "S-PARA": 32, "B-ENAT": 33, "I-ENAT": 34, "E-ENAT": 35, "S-ENAT": 36, "B-CONPRI": 37, "I-CONPRI": 38, "E-CONPRI": 39, "S-CONPRI": 40, "B-MANS": 41, "I-MANS": 42, "E-MANS": 43, "S-MANS": 44, "B-BIOP": 45, "I-BIOP": 46, "E-BIOP": 47, "S-BIOP": 48} #### fabner_bio配置 - `id`：该语句的实例ID，为字符串类型特征。 - `tokens`：该语句的Token列表，为字符串类型特征的列表。 - `ner_tags`：实体标签列表，为分类标签的列表。 json {"O": 0, "B-MATE": 1, "I-MATE": 2, "B-MANP": 3, "I-MANP": 4, "B-MACEQ": 5, "I-MACEQ": 6, "B-APPL": 7, "I-APPL": 8, "B-FEAT": 9, "I-FEAT": 10, "B-PRO": 11, "I-PRO": 12, "B-CHAR": 13, "I-CHAR": 14, "B-PARA": 15, "I-PARA": 16, "B-ENAT": 17, "I-ENAT": 18, "B-CONPRI": 19, "I-CONPRI": 20, "B-MANS": 21, "I-MANS": 22, "B-BIOP": 23, "I-BIOP": 24} #### fabner_simple配置 - `id`：该语句的实例ID，为字符串类型特征。 - `tokens`：该语句的Token列表，为字符串类型特征的列表。 - `ner_tags`：实体标签列表，为分类标签的列表。 json {"O": 0, "MATE": 1, "MANP": 2, "MACEQ": 3, "APPL": 4, "FEAT": 5, "PRO": 6, "CHAR": 7, "PARA": 8, "ENAT": 9, "CONPRI": 10, "MANS": 11, "BIOP": 12} #### text2tech配置 - `id`：该语句的实例ID，为字符串类型特征。 - `tokens`：该语句的Token列表，为字符串类型特征的列表。 - `ner_tags`：实体标签列表，为分类标签的列表。 json {"O": 0, "Technological System": 1, "Method": 2, "Material": 3, "Technical Field": 4} ### 数据划分 | | 训练集 | 验证集 | 测试集 | |--------|-------|------|------| | fabner | 9435 | 2183 | 2064 | ## 数据集构建 ### 构建初衷 [更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据采集与标准化 [更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁？ [更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注人员是谁？ [更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据使用注意事项 ### 数据集的社会影响 [更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可信息 [更多信息请参阅](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 引用信息 @article{DBLP:journals/jim/KumarS22, author = {Aman Kumar and Binil Starly}, title = {"FabNER": 基于命名实体识别的制造工艺科学领域文献信息抽取}, journal = {J. Intell. Manuf.}, volume = {33}, number = {8}, pages = {2393--2407}, year = {2022}, url = {https://doi.org/10.1007/s10845-021-01807-x}, doi = {10.1007/s10845-021-01807-x}, timestamp = {Sun, 13 Nov 2022 17:52:57 +0100}, biburl = {https://dblp.org/rec/journals/jim/KumarS22.bib}, bibsource = {dblp 计算机科学文献库, https://dblp.org} } ### 贡献致谢感谢 [@phucdev](https://github.com/phucdev) 为本数据集的添加工作。

提供机构：

DFKI-SLT

原始信息汇总

数据集卡片概述

数据集描述

数据集概要

FabNER 是一个用于命名实体识别（Named Entity Recognition, NER）的制造业文本数据集，包含超过350,000个单词。该数据集是从Web of Science中的已知期刊中收集的摘要，涵盖了制造业过程科学研究的领域。每个单词都有定义的类别/实体标签，包括材料（MATE）、制造过程（MANP）、机器/设备（MACEQ）、应用（APPL）、特征（FEAT）、机械性能（PRO）、表征（CHAR）、参数（PARA）、使能技术（ENAT）、概念/原理（CONPRI）、制造标准（MANS）和生物医学（BIOP）。注释以BIOES格式进行：B=开始，I=中间，O=外部，E=结束，S=单个。

支持的任务和排行榜

该数据集支持的任务是命名实体识别（Named Entity Recognition, NER）。

语言

数据集中的语言是英语。

数据集结构

数据实例

一个训练实例的示例如下： json { "id": "0", "tokens": ["Revealed", "the", "location-specific", "flow", "patterns", "and", "quantified", "the", "speeds", "of", "various", "types", "of", "flow", "."], "ner_tags": [0, 0, 0, 46, 49, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] }

数据字段

fabner

id: 句子实例的ID，字符串类型。
tokens: 句子中的单词列表，字符串列表类型。
ner_tags: 实体标签列表，分类标签列表类型。

标签映射如下： json {"O": 0, "B-MATE": 1, "I-MATE": 2, "O-MATE": 3, "E-MATE": 4, "S-MATE": 5, "B-MANP": 6, "I-MANP": 7, "O-MANP": 8, "E-MANP": 9, "S-MANP": 10, "B-MACEQ": 11, "I-MACEQ": 12, "O-MACEQ": 13, "E-MACEQ": 14, "S-MACEQ": 15, "B-APPL": 16, "I-APPL": 17, "O-APPL": 18, "E-APPL": 19, "S-APPL": 20, "B-FEAT": 21, "I-FEAT": 22, "O-FEAT": 23, "E-FEAT": 24, "S-FEAT": 25, "B-PRO": 26, "I-PRO": 27, "O-PRO": 28, "E-PRO": 29, "S-PRO": 30, "B-CHAR": 31, "I-CHAR": 32, "O-CHAR": 33, "E-CHAR": 34, "S-CHAR": 35, "B-PARA": 36, "I-PARA": 37, "O-PARA": 38, "E-PARA": 39, "S-PARA": 40, "B-ENAT": 41, "I-ENAT": 42, "O-ENAT": 43, "E-ENAT": 44, "S-ENAT": 45, "B-CONPRI": 46, "I-CONPRI": 47, "O-CONPRI": 48, "E-CONPRI": 49, "S-CONPRI": 50, "B-MANS": 51, "I-MANS": 52, "O-MANS": 53, "E-MANS": 54, "S-MANS": 55, "B-BIOP": 56, "I-BIOP": 57, "O-BIOP": 58, "E-BIOP": 59, "S-BIOP": 60}

fabner_bio

id: 句子实例的ID，字符串类型。
tokens: 句子中的单词列表，字符串列表类型。
ner_tags: 实体标签列表，分类标签列表类型。

标签映射如下： json {"O": 0, "B-MATE": 1, "I-MATE": 2, "B-MANP": 3, "I-MANP": 4, "B-MACEQ": 5, "I-MACEQ": 6, "B-APPL": 7, "I-APPL": 8, "B-FEAT": 9, "I-FEAT": 10, "B-PRO": 11, "I-PRO": 12, "B-CHAR": 13, "I-CHAR": 14, "B-PARA": 15, "I-PARA": 16, "B-ENAT": 17, "I-ENAT": 18, "B-CONPRI": 19, "I-CONPRI": 20, "B-MANS": 21, "I-MANS": 22, "B-BIOP": 23, "I-BIOP": 24}

fabner_simple

id: 句子实例的ID，字符串类型。
tokens: 句子中的单词列表，字符串列表类型。
ner_tags: 实体标签列表，分类标签列表类型。

标签映射如下： json {"O": 0, "MATE": 1, "MANP": 2, "MACEQ": 3, "APPL": 4, "FEAT": 5, "PRO": 6, "CHAR": 7, "PARA": 8, "ENAT": 9, "CONPRI": 10, "MANS": 11, "BIOP": 12}

text2tech

id: 句子实例的ID，字符串类型。
tokens: 句子中的单词列表，字符串列表类型。
ner_tags: 实体标签列表，分类标签列表类型。

标签映射如下： json {"O": 0, "Technological System": 1, "Method": 2, "Material": 3, "Technical Field": 4}

数据分割

	训练集	验证集	测试集
fabner	9435	2183	2064

数据集创建

数据集来源

数据集是从Web of Science中的已知期刊中收集的摘要，涵盖了制造业过程科学研究的领域。

注释过程

注释由专家生成，以BIOES格式进行：B=开始，I=中间，O=外部，E=结束，S=单个。

数据集使用注意事项

数据集的社交影响

该数据集主要用于提高制造业领域的信息提取和命名实体识别的准确性。

数据集的偏见讨论

数据集可能包含特定领域的偏见，需要在使用时进行评估和调整。

其他已知限制

数据集可能受限于特定领域的术语和表达方式，可能不适用于所有通用场景。

附加信息

数据集许可证

数据集的许可证类型为“其他”。

数据集引用信息

@article{DBLP:journals/jim/KumarS22, author = {Aman Kumar and Binil Starly}, title = {"FabNER": information extraction from manufacturing process science domain literature using named entity recognition}, journal = {J. Intell. Manuf.}, volume = {33}, number = {8}, pages = {2393--2407}, year = {2022}, url = {https://doi.org/10.1007/s10845-021-01807-x}, doi = {10.1007/s10845-021-01807-x}, timestamp = {Sun, 13 Nov 2022 17:52:57 +0100}, biburl = {https://dblp.org/rec/journals/jim/KumarS22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

数据集贡献者

感谢 @phucdev 添加此数据集。

搜集汇总

数据集介绍

构建方式

FabNER数据集的构建基于领域专家的标注，它收集了制造过程科学领域文献的摘要，并通过定义详细的实体类别和标签，实现了对文本中命名实体的识别。该数据集分为训练集、验证集和测试集，每个集合都包含了文本片段和相应的实体标签，这些标签遵循'BIOES'格式，以标识实体边界的开始、中间、结束和单一出现。

使用方法

使用该数据集时，研究者可以根据自己的需求选择不同的配置版本。数据集可以通过HuggingFace的datasets库进行加载，利用其提供的 splits 功能来访问训练集、验证集和测试集。每个数据实例都包括一个唯一标识符、一组词汇和对应的实体标签，研究者可以使用这些数据进行模型训练、评估和测试。

背景与挑战

背景概述

FabNER数据集，创建于21世纪初，是由专家生成的针对制造业文本的命名实体识别数据集。该数据集由DFKI和SLT共同构建，旨在从制造业过程科学领域的文献中提取信息。它包含了从Web of Science获取的350,000余词的摘要集合，涵盖了材料、制造过程、设备、应用、特性、机械性能、表征、参数、使能技术、概念/原理、制造标准和生物医学等多个实体类别。数据集的构建为制造业文本的信息提取和实体识别研究提供了重要资源，对相关领域产生了显著影响。

当前挑战

在构建FabNER数据集的过程中，研究人员面临了多个挑战。首先，制造业文本的专业性和复杂性使得实体识别任务充满困难。其次，确保数据标注的质量和一致性需要专业的知识和细致的工作。此外，数据集的多样性和规模也提出了对标注和处理的特殊要求。在领域问题解决方面，FabNER数据集的挑战在于如何准确地识别和分类制造业文献中的各类实体，以支持有效的信息提取。

常用场景

经典使用场景

FabNER数据集的经典使用场景在于制造业文本的命名实体识别任务。该数据集包含丰富的制造过程科学领域文献摘要，标注了材料、制造过程、设备、应用等多个实体的类别，为研究人员提供了一种从文本中自动提取关键信息的有效手段。

解决学术问题

该数据集解决了制造业文献中关键信息提取的学术研究问题，如自动化识别材料、设备、制造标准等实体，从而提高了信息检索的效率和质量，对制造业的知识管理和技术创新具有重要意义。

实际应用

在实际应用中，FabNER数据集可用于构建智能信息检索系统，辅助工程师快速定位和获取制造过程中的关键技术和参数，进而优化生产流程，提高制造业的智能化水平。

数据集最近研究