five

MixRED

收藏
arXiv2024-03-23 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2403.15696v1
下载链接
链接失效反馈
官方服务:
资源简介:
MixRED是南京大学国家重点实验室新软件技术创建的首个混合语言关系抽取数据集,专注于代码转换场景下的关系抽取任务。该数据集通过融合英语和中文文档,采用系统框架识别文本中的关键语言元素,并替换为另一种语言的语义等价元素,以增强模型对混合语言内容的理解。MixRED数据集不仅包含混合语言文档,还提供中英文的单语版本,适用于双语研究。数据集的创建过程涉及多级混合策略,包括句子间、句子内和实体级别,以及考虑不同语言浓度的混合样本生成。MixRED的应用领域主要集中在解决混合语言环境下的关系抽取问题,为模型在混合语言场景中的性能提升提供了重要资源。

MixRED is the first mixed-language relation extraction dataset developed by the State Key Laboratory for Novel Software Technology at Nanjing University, focusing on the relation extraction task in code conversion scenarios. This dataset enhances models' understanding of mixed-language content by fusing English and Chinese documents, employing a systematic framework to identify key linguistic elements in the text and replace them with semantically equivalent elements in the other language. In addition to mixed-language documents, the MixRED dataset also provides monolingual versions in both English and Chinese, making it suitable for bilingual research. The dataset creation process involves a multi-level mixing strategy, including inter-sentence, intra-sentence, and entity-level mixing, as well as mixed sample generation that accounts for varying language concentrations. The main application scope of MixRED is addressing relation extraction tasks in mixed-language environments, providing a valuable resource for improving model performance in mixed-language scenarios.
提供机构:
国家重点实验室新软件技术, 南京大学, 中国
创建时间:
2024-03-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作