WEC-Eng

Name: WEC-Eng
Creator: maas
Published: 2025-12-05 16:44:04
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-30 收录

下载链接：

https://modelscope.cn/datasets/Intel/WEC-Eng

下载链接

链接失效反馈

官方服务：

资源简介：

# WEC-Eng A large-scale dataset for cross-document event coreference extracted from English Wikipedia. - **Repository (Code for generating WEC):** https://github.com/AlonEirew/extract-wec - **Paper:** https://aclanthology.org/2021.naacl-main.198/ ### Languages English ## Load Dataset You can read in WEC-Eng files as follows (using the **huggingface_hub** library): ```json from huggingface_hub import hf_hub_url, cached_download import json REPO_ID = "datasets/Intel/WEC-Eng" splits_files = ["Dev_Event_gold_mentions_validated.json", "Test_Event_gold_mentions_validated.json", "Train_Event_gold_mentions.json"] wec_eng = list() for split_file in splits_files: wec_eng.append(json.load(open(cached_download( hf_hub_url(REPO_ID, split_file)), "r"))) ``` ## Dataset Structure ### Data Splits - **Final version of the English CD event coreference dataset** - Train - Train_Event_gold_mentions.json - Dev - Dev_Event_gold_mentions_validated.json - Test - Test_Event_gold_mentions_validated.json | | Train | Valid | Test | | ----- | ------ | ----- | ---- | | Clusters | 7,042 | 233 | 322 | | Event Mentions | 40,529 | 1250 | 1,893 | - **The non (within clusters) controlled version of the dataset (lexical diversity)** - All (experimental) - All_Event_gold_mentions_unfiltered.json ### Data Instances ```json { "coref_chain": 2293469, "coref_link": "Family Values Tour 1998", "doc_id": "House of Pain", "mention_context": [ "From", "then", "on", ",", "the", "members", "continued", "their" ], "mention_head": "Tour", "mention_head_lemma": "Tour", "mention_head_pos": "PROPN", "mention_id": "108172", "mention_index": 1, "mention_ner": "UNK", "mention_type": 8, "predicted_coref_chain": null, "sent_id": 2, "tokens_number": [ 50, 51, 52, 53 ], "tokens_str": "Family Values Tour 1998", "topic_id": -1 } ``` ### Data Fields |Field|Value Type|Value| |---|:---:|---| |coref_chain|Numeric|Coreference chain/cluster ID| |coref_link|String|Coreference link wikipeida page/article title| |doc_id|String|Mention page/article title| |mention_context|List[String]|Tokenized mention paragraph (including mention)| |mention_head|String|Mention span head token| |mention_head_lemma|String|Mention span head token lemma| |mention_head_pos|String|Mention span head token POS| |mention_id|String|Mention id| |mention_index|Numeric|Mention index in json file| |mention_ner|String|Mention NER| |tokens_number|List[Numeric]|Mentions tokens ids within the context| |tokens_str|String|Mention span text| |topic_id|Ignore|Ignore| |mention_type|Ignore|Ignore| |predicted_coref_chain|Ignore|Ignore| |sent_id|Ignore|Ignore| ## Citation ``` @inproceedings{eirew-etal-2021-wec, title = "{WEC}: Deriving a Large-scale Cross-document Event Coreference dataset from {W}ikipedia", author = "Eirew, Alon and Cattan, Arie and Dagan, Ido", booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jun, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.naacl-main.198", doi = "10.18653/v1/2021.naacl-main.198", pages = "2498--2510", abstract = "Cross-document event coreference resolution is a foundational task for NLP applications involving multi-text processing. However, existing corpora for this task are scarce and relatively small, while annotating only modest-size clusters of documents belonging to the same topic. To complement these resources and enhance future research, we present Wikipedia Event Coreference (WEC), an efficient methodology for gathering a large-scale dataset for cross-document event coreference from Wikipedia, where coreference links are not restricted within predefined topics. We apply this methodology to the English Wikipedia and extract our large-scale WEC-Eng dataset. Notably, our dataset creation method is generic and can be applied with relatively little effort to other Wikipedia languages. To set baseline results, we develop an algorithm that adapts components of state-of-the-art models for within-document coreference resolution to the cross-document setting. Our model is suitably efficient and outperforms previously published state-of-the-art results for the task.", } ``` ## License We provide the following data sets under a <a href="https://creativecommons.org/licenses/by-sa/3.0/deed.en_US">Creative Commons Attribution-ShareAlike 3.0 Unported License</a>. It is based on content extracted from Wikipedia that is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License ## Contact If you have any questions please create a Github issue at https://github.com/AlonEirew/extract-wec.

# WEC-Eng 一个从英文维基百科中提取的跨文档事件共指（cross-document event coreference）大规模数据集。 - **数据集仓库（WEC生成代码）**：https://github.com/AlonEirew/extract-wec - **相关论文**：https://aclanthology.org/2021.naacl-main.198/ ### 语言英语 ## 数据集加载可通过以下方式读取WEC-Eng数据集文件（使用**huggingface_hub**库）： json from huggingface_hub import hf_hub_url, cached_download import json REPO_ID = "datasets/Intel/WEC-Eng" splits_files = ["Dev_Event_gold_mentions_validated.json", "Test_Event_gold_mentions_validated.json", "Train_Event_gold_mentions.json"] wec_eng = list() for split_file in splits_files: wec_eng.append(json.load(open(cached_download( hf_hub_url(REPO_ID, split_file)), "r"))) ## 数据集结构 ### 数据划分 - **英语跨文档事件共指数据集最终版** - 训练集 - Train_Event_gold_mentions.json - 验证集 - Dev_Event_gold_mentions_validated.json - 测试集 - Test_Event_gold_mentions_validated.json | | 训练集 | 验证集 | 测试集 | | ----- | ------ | ----- | ---- | | 共指簇数 | 7,042 | 233 | 322 | | 事件提及数 | 40,529 | 1250 | 1,893 | - **数据集非簇内受控版本（词汇多样性实验用）** - 全部（实验用） - All_Event_gold_mentions_unfiltered.json ### 数据实例 json { "coref_chain": 2293469, "coref_link": "Family Values Tour 1998", "doc_id": "House of Pain", "mention_context": [ "From", "then", "on", ",", "the", "members", "continued", "their" ], "mention_head": "Tour", "mention_head_lemma": "Tour", "mention_head_pos": "PROPN", "mention_id": "108172", "mention_index": 1, "mention_ner": "UNK", "mention_type": 8, "predicted_coref_chain": null, "sent_id": 2, "tokens_number": [ 50, 51, 52, 53 ], "tokens_str": "Family Values Tour 1998", "topic_id": -1 } ### 数据字段 |字段|值类型|字段含义| |---|:---:|---| |coref_chain|Numeric|共指链/簇ID| |coref_link|String|共指链接对应的维基百科页面/文章标题| |doc_id|String|提及所在的页面/文章标题| |mention_context|List[String]|分词后的提及段落（包含提及本身）| |mention_head|String|提及片段的中心词| |mention_head_lemma|String|提及片段中心词的词元| |mention_head_pos|String|提及片段中心词的词性标注| |mention_id|String|提及ID| |mention_index|Numeric|提及在JSON文件中的索引| |mention_ner|String|提及的命名实体识别结果| |tokens_number|List[Numeric]|提及上下文内的标记ID序列| |tokens_str|String|提及片段的文本内容| |topic_id|Ignore|无意义（忽略）| |mention_type|Ignore|无意义（忽略）| |predicted_coref_chain|Ignore|无意义（忽略）| |sent_id|Ignore|无意义（忽略）| ## 引用 @inproceedings{eirew-etal-2021-wec, title = "{WEC}: Deriving a Large-scale Cross-document Event Coreference dataset from {W}ikipedia", author = "Eirew, Alon and Cattan, Arie and Dagan, Ido", booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", month = jun, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.naacl-main.198", doi = "10.18653/v1/2021.naacl-main.198", pages = "2498--2510", abstract = "跨文档事件共指解析是涉及多文本处理的自然语言处理应用的基础任务。然而，当前用于该任务的语料库稀缺且规模相对较小，仅对属于同一主题的小规模文档簇进行了标注。为补充这些资源并推动未来研究，我们提出了维基百科事件共指（WEC）方法，一种从维基百科中大规模提取跨文档事件共指数据集的高效方案，其中共指链接不受预定义主题的限制。我们将该方法应用于英文维基百科，提取得到大规模的WEC-Eng数据集。值得注意的是，我们的数据集构建方法具有通用性，仅需较少工作量即可应用于其他语言的维基百科。为设定基准结果，我们开发了一种算法，将适用于文档内共指解析的前沿模型组件适配到跨文档场景。我们的模型效率优异，且优于该任务此前发布的最优结果。", } ## 许可证本数据集基于<a href="https://creativecommons.org/licenses/by-sa/3.0/deed.en_US">知识共享署名-相同方式共享3.0未移植许可协议</a>发布。其内容源自维基百科，该部分维基百科内容同样采用知识共享署名-相同方式共享3.0未移植许可协议进行授权。 ## 联系方式如有任何疑问，请在https://github.com/AlonEirew/extract-wec提交GitHub Issue。

提供机构：

maas

创建时间：

2025-08-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集