five

aquamuse

收藏
魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/aquamuse
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for AQuaMuSe ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/google-research-datasets/aquamuse - **Repository:** https://github.com/google-research-datasets/aquamuse - **Paper:** https://arxiv.org/pdf/2010.12694.pdf - **Leaderboard:** - **Point of Contact:** ### Dataset Summary AQuaMuSe is a novel scalable approach to automatically mine dual query based multi-document summarization datasets for extractive and abstractive summaries using question answering dataset (Google Natural Questions) and large document corpora (Common Crawl) This dataset contains versions of automatically generated datasets for abstractive and extractive query-based multi-document summarization as described in [AQuaMuSe paper](https://arxiv.org/pdf/2010.12694.pdf). ### Supported Tasks and Leaderboards - **Abstractive** and **Extractive** query-based multi-document summarization - Question Answering ### Languages en : English ## Dataset Structure ### Data Instances - `input_urls`: a `list` of `string` features. - `query`: a `string` feature. - `target`: a `string` feature Example: ``` { 'input_urls': ['https://boxofficebuz.com/person/19653-charles-michael-davis'], 'query': 'who is the actor that plays marcel on the originals', 'target': "In February 2013, it was announced that Davis was cast in a lead role on The CW's new show The Originals, a spinoff of The Vampire Diaries, centered on the Original Family as they move to New Orleans, where Davis' character (a vampire named Marcel) currently rules." } ``` ### Data Fields - `input_urls`: a `list` of `string` features. - List of URLs to input documents pointing to [Common Crawl](https://commoncrawl.org/2017/07/june-2017-crawl-archive-now-available) to be summarized. - Dependencies: Documents URLs references the [Common Crawl June 2017 Archive](https://commoncrawl.org/2017/07/june-2017-crawl-archive-now-available). - `query`: a `string` feature. - Input query to be used as summarization context. This is derived from [Natural Questions](https://ai.google.com/research/NaturalQuestions/) user queries. - `target`: a `string` feature - Summarization target, derived from [Natural Questions](https://ai.google.com/research/NaturalQuestions/) long answers. ### Data Splits - This dataset has two high-level configurations `abstractive` and `extractive` - Each configuration has the data splits of `train`, `dev` and `test` - The original format of the data was in [TFrecords](https://www.tensorflow.org/tutorials/load_data/tfrecord), which has been parsed to the format as specified in [Data Instances](#data-instances) ## Dataset Creation ### Curation Rationale The dataset is automatically generated datasets for abstractive and extractive query-based multi-document summarization as described in [AQuaMuSe paper](https://arxiv.org/pdf/2010.12694.pdf). ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators The dataset curator is [sayalikulkarni](https://github.com/google-research-datasets/aquamuse/commits?author=sayalikulkarni), who is the contributor for the official GitHub repository for this dataset and also one of the authors of this dataset’s paper. As the account handles of other authors are not available currently who were also part of the curation of this dataset, the authors of the paper are mentioned here as follows, Sayali Kulkarni, Sheide Chammas, Wan Zhu, Fei Sha, and Eugene Ie. ### Licensing Information [More Information Needed] ### Citation Information @misc{kulkarni2020aquamuse, title={AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization}, author={Sayali Kulkarni and Sheide Chammas and Wan Zhu and Fei Sha and Eugene Ie}, year={2020}, eprint={2010.12694}, archivePrefix={arXiv}, primaryClass={cs.CL} } ### Contributions Thanks to [@Karthik-Bhaskar](https://github.com/Karthik-Bhaskar) for adding this dataset.

# AQuaMuSe 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与评测基准](#supported-tasks-and-leaderboards) - [涉及语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段说明](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [数据集构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差分析](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **项目主页**: https://github.com/google-research-datasets/aquamuse - **代码仓库**: https://github.com/google-research-datasets/aquamuse - **相关论文**: https://arxiv.org/pdf/2010.12694.pdf - **评测基准榜**: - **联系方式**: ### 数据集概述 AQuaMuSe 是一种新颖的可扩展方法,可借助问答数据集(Google Natural Questions)与大规模文档语料库(Common Crawl),自动挖掘基于双查询的抽取式与抽象式多文档摘要数据集。 本数据集包含自动生成的基于查询的多文档摘要数据集版本,涵盖抽象式与抽取式两种类型,具体细节可参见[AQuaMuSe 论文](https://arxiv.org/pdf/2010.12694.pdf)。 ### 支持任务与评测基准 - **抽象式(Abstractive)**与**抽取式(Extractive)**基于查询的多文档摘要任务 - 问答任务(Question Answering) ### 涉及语言 英语(en) ## 数据集结构 ### 数据实例 - `input_urls`: 字符串特征列表 - `query`: 字符串特征 - `target`: 字符串特征 示例: { 'input_urls': ['https://boxofficebuz.com/person/19653-charles-michael-davis'], 'query': 'who is the actor that plays marcel on the originals', 'target': "In February 2013, it was announced that Davis was cast in a lead role on The CW's new show The Originals, a spinoff of The Vampire Diaries, centered on the Original Family as they move to New Orleans, where Davis' character (a vampire named Marcel) currently rules." } ### 数据字段说明 - `input_urls`: 字符串特征列表 - 指向待摘要文档的URL列表,这些文档均来自[Common Crawl 2017年6月归档](https://commoncrawl.org/2017/07/june-2017-crawl-archive-now-available) - 依赖说明:文档URL均引用自[Common Crawl 2017年6月归档](https://commoncrawl.org/2017/07/june-2017-crawl-archive-now-available) - `query`: 字符串特征 - 用作摘要上下文的输入查询,该查询源自[Google Natural Questions](https://ai.google.com/research/NaturalQuestions/)的用户查询 - `target`: 字符串特征 - 摘要目标文本,源自[Google Natural Questions](https://ai.google.com/research/NaturalQuestions/)的长答案 ### 数据划分 - 本数据集包含两个高级配置:`abstractive`(抽象式)与`extractive`(抽取式) - 每个配置均设有`train`(训练集)、`dev`(验证集)与`test`(测试集)数据划分 - 数据集原始格式为TFrecords(TensorFlow Record文件格式),现已转换为[数据实例](#data-instances)中指定的格式 ## 数据集构建 ### 数据集构建初衷 本数据集为自动生成的基于查询的抽象式与抽取式多文档摘要数据集,具体细节可参见[AQuaMuSe 论文](https://arxiv.org/pdf/2010.12694.pdf)。 ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源语言生产者是谁? [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注人员是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差分析 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 本数据集的维护者为[sayalikulkarni](https://github.com/google-research-datasets/aquamuse/commits?author=sayalikulkarni),其为该数据集官方GitHub仓库的贡献者,同时也是本数据集论文的作者之一。由于目前暂无其他参与数据集构建的作者的账号信息,现将论文全体作者列出如下:Sayali Kulkarni、Sheide Chammas、Wan Zhu、Fei Sha 与 Eugene Ie。 ### 许可信息 [需补充更多信息] ### 引用信息 bibtex @misc{kulkarni2020aquamuse, title={AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization}, author={Sayali Kulkarni and Sheide Chammas and Wan Zhu and Fei Sha and Eugene Ie}, year={2020}, eprint={2010.12694}, archivePrefix={arXiv}, primaryClass={cs.CL} } ### 贡献致谢 感谢 [@Karthik-Bhaskar](https://github.com/Karthik-Bhaskar) 为本数据集添加至仓库的工作。
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作