coref-data/mmc_raw

Name: coref-data/mmc_raw
Creator: coref-data
Published: 2024-01-19 00:03:40
License: 暂无描述

Hugging Face2024-01-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/coref-data/mmc_raw

下载链接

链接失效反馈

官方服务：

资源简介：

MMC（多语言多方指代消解）数据集，用于论文《多方对话中的多语言指代消解》（TACL 2023）。该数据集基于电视转录，旨在解决多方对话中实体指代消解的挑战。利用多种语言的高质量字幕，通过注释投影技术在其他语言（如中文和波斯语）中创建指代消解数据。

提供机构：

coref-data

原始信息汇总

MMC (Multilingual Multiparty Coreference)

数据集配置

mmc_en:
- 训练集: mmc_en/train-*
- 开发集: mmc_en/dev-*
- 测试集: mmc_en/test-*
mmc_fa:
- 训练集: mmc_fa/train-*
- 开发集: mmc_fa/dev-*
- 测试集: mmc_fa/test-*
mmc_fa_corrected:
- 训练集: mmc_fa_corrected/train-*
- 开发集: mmc_fa_corrected/dev-*
- 测试集: mmc_fa_corrected/test-*
mmc_zh_corrected:
- 训练集: mmc_zh_corrected/train-*
- 开发集: mmc_zh_corrected/dev-*
- 测试集: mmc_zh_corrected/test-*
mmc_zh_uncorrected:
- 训练集: mmc_zh_uncorrected/train-*
- 开发集: mmc_zh_uncorrected/dev-*
- 测试集: mmc_zh_uncorrected/test-*

数据来源

数据集用于论文 "Multilingual Coreference Resolution in Multiparty Dialogue"，发表于 TACL 2023。

引用

@article{zheng-etal-2023-multilingual, title = "Multilingual Coreference Resolution in Multiparty Dialogue", author = "Zheng, Boyuan and Xia, Patrick and Yarmohammadi, Mahsa and Van Durme, Benjamin", journal = "Transactions of the Association for Computational Linguistics", volume = "11", year = "2023", address = "Cambridge, MA", publisher = "MIT Press", url = "https://aclanthology.org/2023.tacl-1.52", doi = "10.1162/tacl_a_00581", pages = "922--940", abstract = "Existing multiparty dialogue datasets for entity coreference resolution are nascent, and many challenges are still unaddressed. We create a large-scale dataset, Multilingual Multiparty Coref (MMC), for this task based on TV transcripts. Due to the availability of gold-quality subtitles in multiple languages, we propose reusing the annotations to create silver coreference resolution data in other languages (Chinese and Farsi) via annotation projection. On the gold (English) data, off-the-shelf models perform relatively poorly on MMC, suggesting that MMC has broader coverage of multiparty coreference than prior datasets. On the silver data, we find success both using it for data augmentation and training from scratch, which effectively simulates the zero-shot cross-lingual setting.", }

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集