pborchert/CORE

Name: pborchert/CORE
Creator: pborchert
Published: 2023-10-21 18:12:18
License: 暂无描述

Hugging Face2023-10-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/pborchert/CORE

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-classification - zero-shot-classification tags: - relation-classification - relation-extraction - few-shot - domain-adaptation - business - finance language: - en size_categories: - 1K<n<10K --- # Dataset Card for CORE: A Few-Shot Company Relation Classification Dataset for Robust Domain Adaptation.  CORE includes 4,708 instances of 12 relation types with corresponding textual evidence extracted from company Wikipedia pages. It contains an annotated NOTA (none-of-the-above) category. ## Dataset Details ### Dataset Description  We introduce CORE, a dataset for few-shot relation classification (RC) focused on company relations and business entities. CORE includes 4,708 instances of 12 relation types with corresponding textual evidence extracted from company Wikipedia pages. Company names and business entities pose a challenge for few-shot RC models due to the rich and diverse information associated with them. For example, a company name may represent the legal entity, products, people, or business divisions depending on the context. Therefore, deriving the relation type between entities is highly dependent on textual context. To evaluate the performance of state-of-the-art RC models on the CORE dataset, we conduct experiments in the few-shot domain adaptation setting. Our results reveal substantial performance gaps, confirming that models trained on different domains struggle to adapt to CORE. Interestingly, we find that models trained on CORE showcase improved out-of-domain performance, which highlights the importance of high-quality data for robust domain adaptation. Specifically, the information richness embedded in business entities allows models to focus on contextual nuances, reducing their reliance on superficial clues such as relation-specific verbs. In addition to the dataset, we provide relevant code snippets to facilitate reproducibility and encourage further research in the field. ### Dataset Sources [optional]  - **Repository:** https://github.com/pnborchert/CORE - **Paper:** https://arxiv.org/abs/2310.12024 ## Dataset Structure The dataset is split in training and test instances with **overlapping relation types**. Relation types inlcuded in the test set should be excluded from the training set in the episode sampling procedure [sample_configuration.py](https://github.com/pnborchert/CORE/blob/master/benchmark/fs/sample_configuration.py). - `train`: Contains 4000 training instances and 12 relation types. - `test`: Contains 708 instances and 12 relation types. - `relation_description`: Textual descriptions of the relation types. ## Citation ```bibtex @misc{borchert2023core, title={CORE: A Few-Shot Company Relation Classification Dataset for Robust Domain Adaptation}, author={Philipp Borchert and Jochen De Weerdt and Kristof Coussement and Arno De Caigny and Marie-Francine Moens}, year={2023}, eprint={2310.12024}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

提供机构：

pborchert

原始信息汇总

数据集卡片：CORE - 用于鲁棒域适应的少量样本公司关系分类数据集

数据集概述

CORE 数据集包含 4,708 个实例，涉及 12 种关系类型，并附有从公司维基百科页面提取的相应文本证据。数据集包含一个注释的 NOTA（无以上任何关系）类别。

数据集详情

数据集描述

CORE 是一个针对公司关系和商业实体的少量样本关系分类（RC）数据集。由于与公司名称和商业实体相关的丰富和多样信息，这些实体对少量样本 RC 模型构成挑战。例如，公司名称可能根据上下文代表法律实体、产品、人员或业务部门。因此，推导实体之间的关系类型高度依赖于文本上下文。为了评估最先进的 RC 模型在 CORE 数据集上的性能，我们在少量样本域适应设置中进行了实验。我们的结果显示了显著的性能差距，证实了在不同域上训练的模型难以适应 CORE。有趣的是，我们发现基于 CORE 训练的模型展示了改进的域外性能，这突显了高质量数据对鲁棒域适应的重要性。具体而言，商业实体中嵌入的信息丰富性使模型能够关注上下文细微差别，减少对表面线索（如关系特定动词）的依赖。除了数据集，我们还提供了相关代码片段以促进可重复性，并鼓励该领域的进一步研究。

数据集结构

数据集分为训练和测试实例，具有重叠的关系类型。测试集中包含的关系类型应在训练集中排除，在采样过程中执行此操作 sample_configuration.py。

train: 包含 4000 个训练实例和 12 种关系类型。
test: 包含 708 个实例和 12 种关系类型。
relation_description: 关系类型的文本描述。

引用

bibtex @misc{borchert2023core, title={CORE: A Few-Shot Company Relation Classification Dataset for Robust Domain Adaptation}, author={Philipp Borchert and Jochen De Weerdt and Kristof Coussement and Arno De Caigny and Marie-Francine Moens}, year={2023}, eprint={2310.12024}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集