allenai/cochrane_sparse_mean
收藏Hugging Face2022-11-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/allenai/cochrane_sparse_mean
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- expert-generated
language:
- en
license:
- apache-2.0
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- extended|other-MS^2
- extended|other-Cochrane
task_categories:
- summarization
- text2text-generation
paperswithcode_id: multi-document-summarization
pretty_name: MSLR Shared Task
---
This is a copy of the [Cochrane](https://huggingface.co/datasets/allenai/mslr2022) dataset, except the input source documents of its `validation` split have been replaced by a __sparse__ retriever. The retrieval pipeline used:
- __query__: The `target` field of each example
- __corpus__: The union of all documents in the `train`, `validation` and `test` splits. A document is the concatenation of the `title` and `abstract`.
- __retriever__: BM25 via [PyTerrier](https://pyterrier.readthedocs.io/en/latest/) with default settings
- __top-k strategy__: `"mean"`, i.e. the number of documents retrieved, `k`, is set as the mean number of documents seen across examples in this dataset, in this case `k==9`
Retrieval results on the `train` set:
| Recall@100 | Rprec | Precision@k | Recall@k |
| ----------- | ----------- | ----------- | ----------- |
| 0.7014 | 0.3841 | 0.2976 | 0.4157 |
Retrieval results on the `validation` set:
| Recall@100 | Rprec | Precision@k | Recall@k |
| ----------- | ----------- | ----------- | ----------- |
| 0.7226 | 0.4023 | 0.3095 | 0.4443 |
Retrieval results on the `test` set:
N/A. Test set is blind so we do not have any queries.
提供机构:
allenai
原始信息汇总
数据集概述
基本信息
- 语言: 英语 (
en) - 许可证: Apache-2.0
- 多语言性: 单语种
- 大小: 10,000 < n < 100,000
数据来源与创建者
- 注释创建者: 专家生成
- 语言创建者: 专家生成
- 源数据集:
- 扩展自
other-MS^2 - 扩展自
other-Cochrane
- 扩展自
任务类别
- 摘要生成 (
summarization) - 文本到文本生成 (
text2text-generation)
数据集名称与标识
- 美观名称: MSLR Shared Task
- paperswithcode ID: multi-document-summarization
数据处理
- 验证集输入文档: 使用稀疏检索器替换
- 检索流程:
- 查询: 每个示例的
target字段 - 语料库:
train,validation,test分割中所有文档的联合,文档由title和abstract串联而成 - 检索器: 使用 PyTerrier 的 BM25,默认设置
- top-k 策略:
"mean",即检索的文档数k设置为该数据集中示例平均看到的文档数,此处k==9
- 查询: 每个示例的
检索结果
- 训练集:
- Recall@100: 0.7014
- Rprec: 0.3841
- Precision@k: 0.2976
- Recall@k: 0.4157
- 验证集:
- Recall@100: 0.7226
- Rprec: 0.4023
- Precision@k: 0.3095
- Recall@k: 0.4443
- 测试集: 无数据,因测试集为盲测,无查询信息



