boschresearch/sofc_materials_articles
收藏Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/boschresearch/sofc_materials_articles
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- found
language:
- en
license:
- cc-by-4.0
multilinguality:
- monolingual
size_categories:
- n<1K
source_datasets:
- original
task_categories:
- text-generation
- fill-mask
- token-classification
- text-classification
task_ids:
- named-entity-recognition
- slot-filling
- topic-classification
pretty_name: SofcMaterialsArticles
dataset_info:
features:
- name: text
dtype: string
- name: sentence_offsets
sequence:
- name: begin_char_offset
dtype: int64
- name: end_char_offset
dtype: int64
- name: sentences
sequence: string
- name: sentence_labels
sequence: int64
- name: token_offsets
sequence:
- name: offsets
sequence:
- name: begin_char_offset
dtype: int64
- name: end_char_offset
dtype: int64
- name: tokens
sequence:
sequence: string
- name: entity_labels
sequence:
sequence:
class_label:
names:
'0': B-DEVICE
'1': B-EXPERIMENT
'2': B-MATERIAL
'3': B-VALUE
'4': I-DEVICE
'5': I-EXPERIMENT
'6': I-MATERIAL
'7': I-VALUE
'8': O
- name: slot_labels
sequence:
sequence:
class_label:
names:
'0': B-anode_material
'1': B-cathode_material
'2': B-conductivity
'3': B-current_density
'4': B-degradation_rate
'5': B-device
'6': B-electrolyte_material
'7': B-experiment_evoking_word
'8': B-fuel_used
'9': B-interlayer_material
'10': B-interconnect_material
'11': B-open_circuit_voltage
'12': B-power_density
'13': B-resistance
'14': B-support_material
'15': B-thickness
'16': B-time_of_operation
'17': B-voltage
'18': B-working_temperature
'19': I-anode_material
'20': I-cathode_material
'21': I-conductivity
'22': I-current_density
'23': I-degradation_rate
'24': I-device
'25': I-electrolyte_material
'26': I-experiment_evoking_word
'27': I-fuel_used
'28': I-interlayer_material
'29': I-interconnect_material
'30': I-open_circuit_voltage
'31': I-power_density
'32': I-resistance
'33': I-support_material
'34': I-thickness
'35': I-time_of_operation
'36': I-voltage
'37': I-working_temperature
'38': O
- name: links
sequence:
- name: relation_label
dtype:
class_label:
names:
'0': coreference
'1': experiment_variation
'2': same_experiment
'3': thickness
- name: start_span_id
dtype: int64
- name: end_span_id
dtype: int64
- name: slots
sequence:
- name: frame_participant_label
dtype:
class_label:
names:
'0': anode_material
'1': cathode_material
'2': current_density
'3': degradation_rate
'4': device
'5': electrolyte_material
'6': fuel_used
'7': interlayer_material
'8': open_circuit_voltage
'9': power_density
'10': resistance
'11': support_material
'12': time_of_operation
'13': voltage
'14': working_temperature
- name: slot_id
dtype: int64
- name: spans
sequence:
- name: span_id
dtype: int64
- name: entity_label
dtype:
class_label:
names:
'0': ''
'1': DEVICE
'2': MATERIAL
'3': VALUE
- name: sentence_id
dtype: int64
- name: experiment_mention_type
dtype:
class_label:
names:
'0': ''
'1': current_exp
'2': future_work
'3': general_info
'4': previous_work
- name: begin_char_offset
dtype: int64
- name: end_char_offset
dtype: int64
- name: experiments
sequence:
- name: experiment_id
dtype: int64
- name: span_id
dtype: int64
- name: slots
sequence:
- name: frame_participant_label
dtype:
class_label:
names:
'0': anode_material
'1': cathode_material
'2': current_density
'3': degradation_rate
'4': conductivity
'5': device
'6': electrolyte_material
'7': fuel_used
'8': interlayer_material
'9': open_circuit_voltage
'10': power_density
'11': resistance
'12': support_material
'13': time_of_operation
'14': voltage
'15': working_temperature
- name: slot_id
dtype: int64
splits:
- name: train
num_bytes: 7402373
num_examples: 26
- name: test
num_bytes: 2650700
num_examples: 11
- name: validation
num_bytes: 1993857
num_examples: 8
download_size: 3733137
dataset_size: 12046930
---
# Dataset Card for SofcMaterialsArticles
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [boschresearch/sofc-exp_textmining_resources](https://github.com/boschresearch/sofc-exp_textmining_resources)
- **Repository:** [boschresearch/sofc-exp_textmining_resources](https://github.com/boschresearch/sofc-exp_textmining_resources)
- **Paper:** [The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain](https://arxiv.org/abs/2006.03039)
- **Leaderboard:**
- **Point of Contact:** [Annemarie Friedrich](annemarie.friedrich@de.bosch.com)
### Dataset Summary
> The SOFC-Exp corpus contains 45 scientific publications about solid oxide fuel cells (SOFCs), published between 2013 and 2019 as open-access articles all with a CC-BY license. The dataset was manually annotated by domain experts with the following information:
>
> * Mentions of relevant experiments have been marked using a graph structure corresponding to instances of an Experiment frame (similar to the ones used in FrameNet.) We assume that an Experiment frame is introduced to the discourse by mentions of words such as report, test or measure (also called the frame-evoking elements). The nodes corresponding to the respective tokens are the heads of the graphs representing the Experiment frame.
> * The Experiment frame related to SOFC-Experiments defines a set of 16 possible participant slots. Participants are annotated as dependents of links between the frame-evoking element and the participant node.
> * In addition, we provide coarse-grained entity/concept types for all frame participants, i.e, MATERIAL, VALUE or DEVICE. Note that this annotation has not been performed on the full texts but only on sentences containing information about relevant experiments, and a few sentences in addition. In the paper, we run experiments for both tasks only on the set of sentences marked as experiment-describing in the gold standard, which is admittedly a slightly simplified setting. Entity types are only partially annotated on other sentences. Slot filling could of course also be evaluated in a fully automatic setting with automatic experiment sentence detection as a first step.
### Supported Tasks and Leaderboards
- `topic-classification`: The dataset can be used to train a model for topic-classification, to identify sentences that mention SOFC-related experiments.
- `named-entity-recognition`: The dataset can be used to train a named entity recognition model to detect `MATERIAL`, `VALUE`, `DEVICE`, and `EXPERIMENT` entities.
- `slot-filling`: The slot-filling task is approached as fine-grained entity-typing-in-context, assuming that each sentence represents a single experiment frame. Sequence tagging architectures are utilized for tagging the tokens of each experiment-describing sentence with the set of slot types.
The paper experiments with BiLSTM architectures with `BERT`- and `SciBERT`- generated token embeddings, as well as with `BERT` and `SciBERT` directly for the modeling task. A simple CRF architecture is used as a baseline for sequence-tagging tasks. Implementations of the transformer-based architectures can be found in the `huggingface/transformers` library: [BERT](https://huggingface.co/bert-base-uncased), [SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased)
### Languages
This corpus is in English.
## Dataset Structure
### Data Instances
As each example is a full text of an academic paper, plus annotations, a json formatted example is space-prohibitive for this README.
### Data Fields
- `text`: The full text of the paper
- `sentence_offsets`: Start and end character offsets for each sentence in the text.
- `begin_char_offset`: a `int64` feature.
- `end_char_offset`: a `int64` feature.
- `sentences`: A sequence of the sentences in the text (using `sentence_offsets`)
- `sentence_labels`: Sequence of binary labels for whether a sentence contains information of interest.
- `token_offsets`: Sequence of sequences containing start and end character offsets for each token in each sentence in the text.
- `offsets`: a dictionary feature containing:
- `begin_char_offset`: a `int64` feature.
- `end_char_offset`: a `int64` feature.
- `tokens`: Sequence of sequences containing the tokens for each sentence in the text.
- `feature`: a `string` feature.
- `entity_labels`: a dictionary feature containing:
- `feature`: a classification label, with possible values including `B-DEVICE`, `B-EXPERIMENT`, `B-MATERIAL`, `B-VALUE`, `I-DEVICE`.
- `slot_labels`: a dictionary feature containing:
- `feature`: a classification label, with possible values including `B-anode_material`, `B-cathode_material`, `B-conductivity`, `B-current_density`, `B-degradation_rate`.
- `links`: a dictionary feature containing:
- `relation_label`: a classification label, with possible values including `coreference`, `experiment_variation`, `same_experiment`, `thickness`.
- `start_span_id`: a `int64` feature.
- `end_span_id`: a `int64` feature.
- `slots`: a dictionary feature containing:
- `frame_participant_label`: a classification label, with possible values including `anode_material`, `cathode_material`, `current_density`, `degradation_rate`, `device`.
- `slot_id`: a `int64` feature.
- `spans`: a dictionary feature containing:
- `span_id`: a `int64` feature.
- `entity_label`: a classification label, with possible values including ``, `DEVICE`, `MATERIAL`, `VALUE`.
- `sentence_id`: a `int64` feature.
- `experiment_mention_type`: a classification label, with possible values including ``, `current_exp`, `future_work`, `general_info`, `previous_work`.
- `begin_char_offset`: a `int64` feature.
- `end_char_offset`: a `int64` feature.
- `experiments`: a dictionary feature containing:
- `experiment_id`: a `int64` feature.
- `span_id`: a `int64` feature.
- `slots`: a dictionary feature containing:
- `frame_participant_label`: a classification label, with possible values including `anode_material`, `cathode_material`, `current_density`, `degradation_rate`, `conductivity`.
- `slot_id`: a `int64` feature.
Very detailed information for each of the fields can be found in the [corpus file formats section](https://github.com/boschresearch/sofc-exp_textmining_resources#corpus-file-formats) of the associated dataset repo
### Data Splits
This dataset consists of three splits:
| | Train | Valid | Test |
| ----- | ------ | ----- | ---- |
| Input Examples | 26 | 8 | 11 |
The authors propose the experimental setting of using the training data in a 5-fold cross validation setting for development and tuning, and finally applying tte model(s) to the independent test set.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
The corpus consists of 45
open-access scientific publications about SOFCs
and related research, annotated by domain experts.
### Annotations
#### Annotation process
For manual annotation, the authors use the InCeption annotation tool (Klie et al., 2018).
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
The manual annotations created for the SOFC-Exp corpus are licensed under a [Creative Commons Attribution 4.0 International License (CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/).
### Citation Information
```
@misc{friedrich2020sofcexp,
title={The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain},
author={Annemarie Friedrich and Heike Adel and Federico Tomazic and Johannes Hingerl and Renou Benteau and Anika Maruscyk and Lukas Lange},
year={2020},
eprint={2006.03039},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Contributions
Thanks to [@ZacharySBrown](https://github.com/ZacharySBrown) for adding this dataset.
annotations_creators:
- 专家生成(expert-generated)
language_creators:
- 现有采集(found)
language:
- 英语(en)
license:
- CC-BY-4.0
multilinguality:
- 单语言(monolingual)
size_categories:
- 少于1000条样本(n<1K)
source_datasets:
- 原始数据集(original)
task_categories:
- 文本生成(text-generation)
- 掩码填充(fill-mask)
- 令牌分类(token-classification)
- 文本分类(text-classification)
task_ids:
- 命名实体识别(named-entity-recognition)
- 槽位填充(slot-filling)
- 主题分类(topic-classification)
pretty_name: SofcMaterialsArticles
# SofcMaterialsArticles 数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据样本](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集创建](#dataset-creation)
- [整理初衷](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知限制](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集整理者](#dataset-curators)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页:** [boschresearch/sofc-exp_textmining_resources](https://github.com/boschresearch/sofc-exp_textmining_resources)
- **代码仓库:** [boschresearch/sofc-exp_textmining_resources](https://github.com/boschresearch/sofc-exp_textmining_resources)
- **相关论文:** [《SOFC-Exp语料库与材料科学领域信息抽取的神经方法》](https://arxiv.org/abs/2006.03039)
- **排行榜:**
- **联系方式:** [Annemarie Friedrich](annemarie.friedrich@de.bosch.com)
### 数据集概述
> SOFC-Exp语料库包含45篇关于固体氧化物燃料电池(SOFCs,solid oxide fuel cells)的学术出版物,均为2013至2019年间发表的开放获取文章,采用CC-BY授权协议。本数据集由领域专家手动标注,标注内容如下:
>
> * 相关实验的提及内容采用与实验框架(Experiment frame,参考框架网络FrameNet)实例对应的图结构进行标记。我们假设实验框架由`report`、`test`或`measure`等词汇的提及内容引入(此类词汇也被称为框架触发元素),与对应标记对应的节点即为代表实验框架的图的头节点。
> * 与SOFC实验相关的实验框架定义了16种可选的参与者槽位,参与者被标注为框架触发元素与参与者节点之间的链接的从属项。
> * 此外,我们为所有框架参与者提供了粗粒度的实体/概念类型,即材料(MATERIAL)、数值(VALUE)与设备(DEVICE)。请注意,本标注并非针对全文,仅针对包含相关实验信息的句子以及少量额外句子。在相关论文中,我们仅针对金标准中被标记为实验描述的句子集合开展两项任务的实验,该设置诚然略显简化。其他句子仅部分标注了实体类型。槽位填充任务当然也可以在全自动设置下进行评估,将自动实验句子检测作为第一步。
### 支持任务与排行榜
- **主题分类(topic-classification)**:本数据集可用于训练主题分类模型,以识别提及SOFC相关实验的句子。
- **命名实体识别(named-entity-recognition)**:本数据集可用于训练命名实体识别模型,以检测材料(MATERIAL)、数值(VALUE)、设备(DEVICE)与实验(EXPERIMENT)实体。
- **槽位填充(slot-filling)**:槽位填充任务被建模为上下文细粒度实体分类,假设每个句子代表一个单一的实验框架。我们采用序列标注架构,为每一句实验描述句子的标记赋予对应的槽位类型标签。
相关论文针对建模任务,测试了基于`BERT`与`SciBERT`生成的令牌嵌入(token embeddings)的双向长短期记忆网络(BiLSTM)架构,以及直接使用`BERT`与`SciBERT`的方案。序列标注任务的基线模型采用简单的条件随机场(CRF)架构。基于Transformer的架构实现可在`huggingface/transformers`库中找到:[BERT](https://huggingface.co/bert-base-uncased)、[SciBERT](https://huggingface.co/allenai/scibert_scivocab_uncased)
### 语言
本语料库采用英语编写。
## 数据集结构
### 数据样本
由于每个样本均为一篇学术论文的全文及对应标注,本README中无法展示JSON格式的样本示例。
### 数据字段
- `text`:论文的完整文本
- `sentence_offsets`:文本中每个句子的起始与结束字符偏移量。
- `begin_char_offset`:`int64`类型特征,记录字符起始偏移位置。
- `end_char_offset`:`int64`类型特征,记录字符结束偏移位置。
- `sentences`:文本中所有句子的序列(通过`sentence_offsets`生成)
- `sentence_labels`:二进制标签序列,用于标记句子是否包含目标信息
- `token_offsets`:多层序列结构,存储文本中每个句子的每个令牌的起始与结束字符偏移量。
- `offsets`:字典类型特征,包含:
- `begin_char_offset`:`int64`类型特征,记录令牌起始字符偏移位置。
- `end_char_offset`:`int64`类型特征,记录令牌结束字符偏移位置。
- `tokens`:多层序列结构,存储文本中每个句子的所有令牌。
- `feature`:`string`类型特征,即令牌文本。
- `entity_labels`:字典类型特征,包含:
- `feature`:分类标签,可选值包括`B-设备(B-DEVICE)`、`B-实验(B-EXPERIMENT)`、`B-材料(B-MATERIAL)`、`B-数值(B-VALUE)`、`I-设备(I-DEVICE)`等。
- `slot_labels`:字典类型特征,包含:
- `feature`:分类标签,可选值包括`B-阳极材料(B-anode_material)`、`B-阴极材料(B-cathode_material)`、`B-电导率(B-conductivity)`、`B-电流密度(B-current_density)`、`B-降解速率(B-degradation_rate)`等。
- `links`:字典类型特征,包含:
- `relation_label`:分类标签,可选值包括`共指(coreference)`、`实验变体(experiment_variation)`、`同一实验(same_experiment)`、`厚度(thickness)`等。
- `start_span_id`:`int64`类型特征,即起始跨度ID。
- `end_span_id`:`int64`类型特征,即结束跨度ID。
- `slots`:字典类型特征,包含:
- `frame_participant_label`:分类标签,可选值包括`阳极材料(anode_material)`、`阴极材料(cathode_material)`、`电流密度(current_density)`、`降解速率(degradation_rate)`、`设备(device)`等。
- `slot_id`:`int64`类型特征,即槽位ID。
- `spans`:字典类型特征,包含:
- `span_id`:`int64`类型特征,即跨度ID。
- `entity_label`:分类标签,可选值包括空值、`设备(DEVICE)`、`材料(MATERIAL)`、`数值(VALUE)`。
- `sentence_id`:`int64`类型特征,即句子ID。
- `experiment_mention_type`:分类标签,可选值包括空值、`当前实验(current_exp)`、`未来工作(future_work)`、`通用信息(general_info)`、`先前工作(previous_work)`。
- `begin_char_offset`:`int64`类型特征,记录字符起始偏移位置。
- `end_char_offset`:`int64`类型特征,记录字符结束偏移位置。
- `experiments`:字典类型特征,包含:
- `experiment_id`:`int64`类型特征,即实验ID。
- `span_id`:`int64`类型特征,即跨度ID。
- `slots`:字典类型特征,包含:
- `frame_participant_label`:分类标签,可选值包括`阳极材料(anode_material)`、`阴极材料(cathode_material)`、`电流密度(current_density)`、`降解速率(degradation_rate)`、`电导率(conductivity)`等。
- `slot_id`:`int64`类型特征,即槽位ID。
各字段的详细说明可在关联数据集仓库的[语料库文件格式章节](https://github.com/boschresearch/sofc-exp_textmining_resources#corpus-file-formats)中查阅。
### 数据划分
本数据集包含三个划分子集:
| | 训练集 | 验证集 | 测试集 |
| ----- | ------ | ----- | ---- |
| 输入样本数 | 26 | 8 | 11 |
作者提出的实验设置为:使用训练数据进行5折交叉验证以开展开发与调优,最终将训练好的模型应用于独立测试集。
## 数据集创建
### 整理初衷
[需要更多信息]
### 源数据
#### 初始数据收集与标准化
[需要更多信息]
#### 语料来源生产者是谁?
本语料库包含45篇关于SOFCs及相关研究的开放获取学术出版物,由领域专家完成标注。
### 标注
#### 标注流程
手动标注环节中,作者使用了InCeption标注工具(Klie等人,2018)。
#### 标注人员是谁?
[需要更多信息]
### 个人与敏感信息
[需要更多信息]
## 数据使用注意事项
### 数据集的社会影响
[需要更多信息]
### 偏差讨论
[需要更多信息]
### 其他已知限制
[需要更多信息]
## 附加信息
### 数据集整理者
[需要更多信息]
### 授权信息
SOFC-Exp语料库的手动标注内容采用[知识共享署名4.0国际许可协议(CC-BY-4.0)](https://creativecommons.org/licenses/by/4.0/)进行授权。
### 引用信息
@misc{friedrich2020sofcexp,
title={The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain},
author={Annemarie Friedrich and Heike Adel and Federico Tomazic and Johannes Hingerl and Renou Benteau and Anika Maruscyk and Lukas Lange},
year={2020},
eprint={2006.03039},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
### 贡献致谢
感谢[@ZacharySBrown](https://github.com/ZacharySBrown)贡献本数据集。
提供机构:
boschresearch
原始信息汇总
数据集概述
数据集基本信息
- 数据集名称: SofcMaterialsArticles
- 语言: 英语
- 许可证: CC-BY-4.0
- 数据集大小: 12046930 字节
- 下载大小: 3733137 字节
- 数据集创建者: 领域专家
- 数据集来源: 原始数据
数据集内容
数据集摘要
SOFC-Exp 语料库包含 45 篇关于固体氧化物燃料电池(SOFC)的科学出版物,这些出版物于 2013 年至 2019 年间发表,均为开放获取文章,并采用 CC-BY 许可证。该数据集由领域专家手动标注了以下信息:
- 相关实验的提及已使用与 FrameNet 中使用的实验框架相对应的图结构进行标记。
- 与 SOFC 实验相关的实验框架定义了一组 16 个可能的参与者槽。参与者被标注为框架引发元素和参与者节点之间链接的依赖项。
- 此外,我们为所有框架参与者提供了粗粒度的实体/概念类型,即 MATERIAL、VALUE 或 DEVICE。
支持的任务和排行榜
- 主题分类: 用于训练模型识别提及 SOFC 相关实验的句子。
- 命名实体识别: 用于训练命名实体识别模型以检测 MATERIAL、VALUE、DEVICE 和 EXPERIMENT 实体。
- 槽填充: 槽填充任务被视为上下文中的细粒度实体类型,假设每个句子代表一个单一的实验框架。
数据集结构
数据实例
每个示例是学术论文的全文加上标注,格式为 JSON。
数据字段
text: 论文的全文sentence_offsets: 每个句子在文本中的起始和结束字符偏移sentences: 文本中的句子序列sentence_labels: 句子是否包含感兴趣信息的二进制标签序列token_offsets: 每个句子中每个词的起始和结束字符偏移序列tokens: 文本中每个句子的词序列entity_labels: 实体标签,包含 B-DEVICE、B-EXPERIMENT、B-MATERIAL、B-VALUE、I-DEVICE 等slot_labels: 槽标签,包含 B-anode_material、B-cathode_material、B-conductivity、B-current_density、B-degradation_rate 等links: 链接信息,包含 coreference、experiment_variation、same_experiment、thickness 等关系标签slots: 槽信息,包含 anode_material、cathode_material、current_density、degradation_rate、device 等框架参与者标签spans: 跨度信息,包含 span_id、entity_label、sentence_id、experiment_mention_type、begin_char_offset、end_char_offset 等experiments: 实验信息,包含 experiment_id、span_id、slots 等
数据分割
数据集分为三个部分:
| 分割 | 训练 | 验证 | 测试 |
|---|---|---|---|
| 示例数量 | 26 | 8 | 11 |
数据集创建
标注过程
使用 InCeption 标注工具进行手动标注。
标注者
领域专家
许可证信息
手动标注的 SOFC-Exp 语料库采用 Creative Commons Attribution 4.0 International License (CC-BY-4.0)。
引用信息
@misc{friedrich2020sofcexp, title={The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain}, author={Annemarie Friedrich and Heike Adel and Federico Tomazic and Johannes Hingerl and Renou Benteau and Anika Maruscyk and Lukas Lange}, year={2020}, eprint={2006.03039}, archivePrefix={arXiv}, primaryClass={cs.CL} }
搜集汇总
数据集介绍

构建方式
该数据集由45篇关于固体氧化物燃料电池(SOFC)的科学文献构成,这些文献均发表于2013年至2019年间,且均为开放获取的CC-BY许可文章。数据集的构建过程涉及领域专家对文献进行手动标注,标注内容包括实验提及、实验框架的参与者槽位以及实体类型等。标注工具采用了InCeption,确保了标注的精确性和一致性。
使用方法
该数据集可用于多种自然语言处理任务,如主题分类、命名实体识别和槽位填充。在主题分类任务中,模型可以识别出包含SOFC相关实验的句子。命名实体识别任务则可以利用数据集中的实体标注信息,识别出材料、设备、数值等实体。槽位填充任务则通过细粒度的实体类型标注,对实验框架中的参与者进行标注。数据集的使用可以通过HuggingFace平台进行加载和处理,支持BERT和SciBERT等预训练模型的应用。
背景与挑战
背景概述
SOFC-Exp数据集由Bosch Research的研究团队于2020年创建,旨在为固体氧化物燃料电池(SOFC)领域的信息提取任务提供高质量的标注数据。该数据集包含了2013年至2019年间发表的45篇开放获取的科学文献,涵盖了SOFC相关的实验、材料和设备等关键信息。通过专家手动标注,数据集不仅标注了实验框架中的关键实体(如材料、设备和数值),还提供了实验框架的参与者槽位信息。该数据集为材料科学领域的信息提取任务提供了重要的基准,推动了自然语言处理技术在科学文献分析中的应用。
当前挑战
SOFC-Exp数据集在构建和应用过程中面临多重挑战。首先,科学文献的语言复杂性使得实体识别和槽位填充任务尤为困难,尤其是在涉及专业术语和复杂实验描述时。其次,数据集的规模相对较小(仅45篇文献),可能限制了模型的泛化能力。此外,标注过程中需要高度依赖领域专家的知识,导致标注成本高昂且难以扩展。最后,尽管数据集提供了丰富的标注信息,但其仅针对实验描述句子进行标注,其他句子的实体类型标注不完整,这可能影响模型在更广泛文本上的表现。
常用场景
经典使用场景
在材料科学领域,SOFC-Exp数据集被广泛用于训练和评估自然语言处理模型,特别是在固体氧化物燃料电池(SOFC)相关文献的信息提取任务中。该数据集通过专家标注的实验框架和实体类型,为模型提供了丰富的上下文信息,使其能够识别和分类实验描述中的关键元素,如材料、设备和实验参数。
解决学术问题
SOFC-Exp数据集解决了材料科学文献中信息提取的复杂性问题。通过提供详细的实验框架和实体标注,该数据集帮助研究人员开发出能够自动识别和分类实验描述中的关键信息的模型。这不仅提高了文献分析的效率,还为材料科学领域的知识发现提供了新的工具和方法。
实际应用
在实际应用中,SOFC-Exp数据集被用于开发自动化文献分析工具,帮助研究人员快速提取和整理固体氧化物燃料电池相关的研究成果。这些工具可以应用于学术研究、工业研发以及政策制定等多个领域,极大地提高了信息处理的效率和准确性。
数据集最近研究
最新研究方向
在固体氧化物燃料电池(SOFC)材料研究领域,boschresearch/sofc_materials_articles数据集为信息抽取任务提供了丰富的实验数据支持。近年来,随着深度学习技术的快速发展,基于该数据集的研究方向主要集中在命名实体识别(NER)、槽填充(Slot Filling)和主题分类等任务上。特别是结合BERT和SciBERT等预训练语言模型,研究者们能够更精准地识别材料、设备和实验相关的实体,并进一步挖掘实验框架中的关键信息。这些研究不仅推动了材料科学领域的信息自动化处理,还为SOFC技术的优化与创新提供了数据驱动的决策支持。
以上内容由遇见数据集搜集并总结生成



