CoreSearch
收藏魔搭社区2025-12-05 更新2025-08-02 收录
下载链接:
https://modelscope.cn/datasets/Intel/CoreSearch
下载链接
链接失效反馈官方服务:
资源简介:
# The CoreSearch Dataset
A large-scale dataset for cross-document event coreference **search**</br>
- **Paper:** Cross-document Event Coreference Search: Task, Dataset and Modeling (link-TBD)
- **<ins>CoreSearchV2:</ins>** A cleaner version of this dataset is now available at [https://huggingface.co/datasets/biu-nlp/CoreSearchV2](https://huggingface.co/datasets/biu-nlp/CoreSearchV2)
### Languages
English
## Load Dataset
You can read/download the dataset files following Huggingface Hub instructions.</br>
For example, below code will load CoreSearch DPR folder:
```python
from huggingface_hub import hf_hub_url, cached_download
import json
REPO_ID = "datasets/Intel/CoreSearch"
DPR_FILES = "/dpr/"
dpr_files = ["dpr/Dev.json", "dpr/Train.json", "dpr/Test.json"]
dpr_jsons = list()
for _file in dpr_files:
dpr_jsons.append(json.load(open(cached_download(
hf_hub_url(REPO_ID, _file)), "r")))
```
### Data Splits
- **Final version of the CD event coreference search dataset**<br>
| | Train | Valid | Test | Total |
| ----- | ------ | ----- | ---- | ---- |
| WEC-Eng Validated Data | | | | |
| # Clusters | 237 | 49 | 236 | 522 |
| # Passages (with Mentions) | 1,503 | 341 | 1,266 | 3,110 |
| # Added Destructor Passages | 922,736 | 923,376 | 923,746 | 2,769,858 |
| # Total Passages | 924,239 | 923,717 | 925,012 | 2,772,968 |
## Citation
```
@inproceedings{eirew-etal-2022-cross,
title = "Cross-document Event Coreference Search: Task, Dataset and Modeling",
author = "Eirew, Alon and
Caciularu, Avi and
Dagan, Ido",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.58",
pages = "900--913",
abstract = "The task of Cross-document Coreference Resolution has been traditionally formulated as requiring to identify all coreference links across a given set of documents. We propose an appealing, and often more applicable, complementary set up for the task {--} Cross-document Coreference Search, focusing in this paper on event coreference. Concretely, given a mention in context of an event of interest, considered as a query, the task is to find all coreferring mentions for the query event in a large document collection. To support research on this task, we create a corresponding dataset, which is derived from Wikipedia while leveraging annotations in the available Wikipedia Event Coreferecene dataset (WEC-Eng). Observing that the coreference search setup is largely analogous to the setting of Open Domain Question Answering, we adapt the prominent Deep Passage Retrieval (DPR) model to our setting, as an appealing baseline. Finally, we present a novel model that integrates a powerful coreference scoring scheme into the DPR architecture, yielding improved performance.",
}
```
## License
We provide the following data sets under a <a href="https://creativecommons.org/licenses/by-sa/3.0/deed.en_US">Creative Commons Attribution-ShareAlike 3.0 Unported License</a>. It is based on content extracted from Wikipedia that is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License
## Contact
If you have any questions please create a Github issue at <a href="https://github.com/AlonEirew/CoreSearch">https://github.com/AlonEirew/CoreSearch</a>.
# CoreSearch数据集
一款面向**跨文档事件共指搜索(cross-document event coreference search)**的大规模数据集
- **论文:** 《跨文档事件共指搜索:任务、数据集与建模》(链接待定)
- **<ins>CoreSearchV2:</ins>** 本数据集的清洁版本现已发布,可通过以下链接获取:[https://huggingface.co/datasets/biu-nlp/CoreSearchV2](https://huggingface.co/datasets/biu-nlp/CoreSearchV2)
### 语言
英语
## 数据集加载
您可按照Huggingface Hub的相关指南读取或下载数据集文件。例如,以下代码可加载CoreSearch的DPR文件夹:
python
from huggingface_hub import hf_hub_url, cached_download
import json
REPO_ID = "datasets/Intel/CoreSearch"
DPR_FILES = "/dpr/"
dpr_files = ["dpr/Dev.json", "dpr/Train.json", "dpr/Test.json"]
dpr_jsons = list()
for _file in dpr_files:
dpr_jsons.append(json.load(open(cached_download(
hf_hub_url(REPO_ID, _file)), "r")))
### 数据划分
#### 跨文档事件共指搜索数据集最终版本
| | 训练集 | 验证集 | 测试集 | 总计 |
| ----- | ------ | ----- | ---- | ---- |
| WEC-Eng 验证数据 | | | | |
| # 簇数量 | 237 | 49 | 236 | 522 |
| # 含提及语段数量 | 1,503 | 341 | 1,266 | 3,110 |
| # 新增干扰语段数量 | 922,736 | 923,376 | 923,746 | 2,769,858 |
| # 总语段数量 | 924,239 | 923,717 | 925,012 | 2,772,968 |
## 引用格式
@inproceedings{eirew-etal-2022-cross,
title = "Cross-document Event Coreference Search: Task, Dataset and Modeling",
author = "Eirew, Alon and
Caciularu, Avi and
Dagan, Ido",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.58",
pages = "900--913",
abstract = "传统上,跨文档共指消解任务被定义为需在给定文档集合中识别所有共指链接。本文针对该任务提出了一种更具实用性的互补框架——跨文档共指搜索,本文聚焦于事件共指场景。具体而言,给定一段上下文提及的目标事件作为查询,任务目标是在大规模文档集合中找到与该查询事件共指的所有提及。为支持该任务的相关研究,我们基于维基百科,并利用现有维基百科事件共指数据集(WEC-Eng)中的标注,构建了对应的数据集。鉴于共指搜索框架与开放域问答场景高度相似,我们将主流的深度段落检索(Deep Passage Retrieval, DPR)模型适配至本任务作为基准方法。最后,我们提出了一种将高效共指评分方案集成至DPR架构的新型模型,该方法取得了更优的性能表现。",
}
## 许可协议
本数据集采用<a href="https://creativecommons.org/licenses/by-sa/3.0/deed.en_US">知识共享署名-相同方式共享3.0未移植许可协议(Creative Commons Attribution-ShareAlike 3.0 Unported License)</a>进行发布。本数据集的内容提取自维基百科,而维基百科内容同样采用该许可协议进行授权。
## 联系方式
若您有任何疑问,请在<a href="https://github.com/AlonEirew/CoreSearch">https://github.com/AlonEirew/CoreSearch</a>提交GitHub Issue。
提供机构:
maas
创建时间:
2025-08-01



