five

ms_marco

收藏
魔搭社区2026-01-06 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/microsoft/ms_marco
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for "ms_marco" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://microsoft.github.io/msmarco/](https://microsoft.github.io/msmarco/) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 1.55 GB - **Size of the generated dataset:** 4.72 GB - **Total amount of disk used:** 6.28 GB ### Dataset Summary Starting with a paper released at NIPS 2016, MS MARCO is a collection of datasets focused on deep learning in search. The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Since then we released a 1,000,000 question dataset, a natural langauge generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search. There have been 277 submissions. 20 KeyPhrase Extraction submissions, 87 passage ranking submissions, 0 document ranking submissions, 73 QnA V2 submissions, 82 NLGEN submisions, and 15 QnA V1 submissions This data comes in three tasks/forms: Original QnA dataset(v1.1), Question Answering(v2.1), Natural Language Generation(v2.1). The original question answering datset featured 100,000 examples and was released in 2016. Leaderboard is now closed but data is availible below. The current competitive tasks are Question Answering and Natural Language Generation. Question Answering features over 1,000,000 queries and is much like the original QnA dataset but bigger and with higher quality. The Natural Language Generation dataset features 180,000 examples and builds upon the QnA dataset to deliver answers that could be spoken by a smart speaker. version v1.1 ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### v1.1 - **Size of downloaded dataset files:** 168.69 MB - **Size of the generated dataset:** 434.61 MB - **Total amount of disk used:** 603.31 MB An example of 'train' looks as follows. ``` ``` #### v2.1 - **Size of downloaded dataset files:** 1.38 GB - **Size of the generated dataset:** 4.29 GB - **Total amount of disk used:** 5.67 GB An example of 'validation' looks as follows. ``` ``` ### Data Fields The data fields are the same among all splits. #### v1.1 - `answers`: a `list` of `string` features. - `passages`: a dictionary feature containing: - `is_selected`: a `int32` feature. - `passage_text`: a `string` feature. - `url`: a `string` feature. - `query`: a `string` feature. - `query_id`: a `int32` feature. - `query_type`: a `string` feature. - `wellFormedAnswers`: a `list` of `string` features. #### v2.1 - `answers`: a `list` of `string` features. - `passages`: a dictionary feature containing: - `is_selected`: a `int32` feature. - `passage_text`: a `string` feature. - `url`: a `string` feature. - `query`: a `string` feature. - `query_id`: a `int32` feature. - `query_type`: a `string` feature. - `wellFormedAnswers`: a `list` of `string` features. ### Data Splits |name|train |validation| test | |----|-----:|---------:|-----:| |v1.1| 82326| 10047| 9650| |v2.1|808731| 101093|101092| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{DBLP:journals/corr/NguyenRSGTMD16, author = {Tri Nguyen and Mir Rosenberg and Xia Song and Jianfeng Gao and Saurabh Tiwary and Rangan Majumder and Li Deng}, title = {{MS} {MARCO:} {A} Human Generated MAchine Reading COmprehension Dataset}, journal = {CoRR}, volume = {abs/1611.09268}, year = {2016}, url = {http://arxiv.org/abs/1611.09268}, archivePrefix = {arXiv}, eprint = {1611.09268}, timestamp = {Mon, 13 Aug 2018 16:49:03 +0200}, biburl = {https://dblp.org/rec/journals/corr/NguyenRSGTMD16.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } } ``` ### Contributions Thanks to [@mariamabarham](https://github.com/mariamabarham), [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun) for adding this dataset.

# "MS MARCO" 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [支持语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页:** [https://microsoft.github.io/msmarco/](https://microsoft.github.io/msmarco/) - **代码仓库:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **相关论文:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **联系人:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集文件大小:** 1.55 GB - **生成后数据集大小:** 4.72 GB - **总磁盘占用:** 6.28 GB ### 数据集摘要 自2016年NIPS会议发表相关论文以来,MS MARCO(Microsoft MAchine Reading COmprehension,微软机器阅读理解数据集)是一系列聚焦于搜索领域深度学习的数据集集合。 首个数据集为问答数据集,包含10万个真实的必应(Bing)搜索查询问题与人工生成的答案。此后我们又陆续发布了包含100万查询的数据集、自然语言生成数据集、段落排序数据集、关键词提取数据集、爬虫数据集以及对话搜索相关数据集。 该数据集累计收到277份提交结果:其中关键词提取任务20份、段落排序任务87份、文档排序任务0份、问答V2任务73份、自然语言生成任务82份,以及问答V1任务15份。 本数据集包含三类任务/形式:原始问答数据集(v1.1)、问答数据集(v2.1)以及自然语言生成数据集(v2.1)。 初代问答数据集包含10万个样本,于2016年发布,其排行榜现已关闭,但数据集仍可通过下方链接获取。 当前的竞赛任务为问答与自然语言生成两项。问答数据集包含超100万条查询,与初代问答数据集思路一致,但规模更大、质量更高。自然语言生成数据集包含18万个样本,基于问答数据集构建,旨在生成可由智能音箱播报的自然答案。 版本v1.1 ### 支持任务与排行榜 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 支持语言 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### v1.1 - **下载数据集文件大小:** 168.69 MB - **生成后数据集大小:** 434.61 MB - **总磁盘占用:** 603.31 MB 训练集(train)示例如下: #### v2.1 - **下载数据集文件大小:** 1.38 GB - **生成后数据集大小:** 4.29 GB - **总磁盘占用:** 5.67 GB 验证集(validation)示例如下: ### 数据字段 所有划分下的数据字段均保持一致。 #### v1.1 - `answers`: 字符串类型列表特征 - `passages`: 字典类型特征,包含以下子字段: - `is_selected`: int32类型特征 - `passage_text`: 字符串类型特征 - `url`: 字符串类型特征 - `query`: 字符串类型特征 - `query_id`: int32类型特征 - `query_type`: 字符串类型特征 - `wellFormedAnswers`: 字符串类型列表特征 #### v2.1 - `answers`: 字符串类型列表特征 - `passages`: 字典类型特征,包含以下子字段: - `is_selected`: int32类型特征 - `passage_text`: 字符串类型特征 - `url`: 字符串类型特征 - `query`: 字符串类型特征 - `query_id`: int32类型特征 - `query_type`: 字符串类型特征 - `wellFormedAnswers`: 字符串类型列表特征 ### 数据划分 | 数据集版本 | 训练集 | 验证集 | 测试集 | | :------- | ------:| -------:| -------:| | v1.1 | 82326 | 10047 | 9650 | | v2.1 | 808731 | 101093 | 101092 | ## 数据集构建 ### 构建初衷 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与标准化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁? [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注人员是谁? [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 引用信息 @article{DBLP:journals/corr/NguyenRSGTMD16, author = {Tri Nguyen and Mir Rosenberg and Xia Song and Jianfeng Gao and Saurabh Tiwary and Rangan Majumder and Li Deng}, title = {{MS} {MARCO:} {A} Human Generated MAchine Reading COmprehension Dataset}, journal = {CoRR}, volume = {abs/1611.09268}, year = {2016}, url = {http://arxiv.org/abs/1611.09268}, archivePrefix = {arXiv}, eprint = {1611.09268}, timestamp = {Mon, 13 Aug 2018 16:49:03 +0200}, biburl = {https://dblp.org/rec/journals/corr/NguyenRSGTMD16.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } } ### 贡献致谢 感谢[@mariamabarham](https://github.com/mariamabarham)、[@thomwolf](https://github.com/thomwolf)与[@lewtun](https://github.com/lewtun)为本数据集的收录工作。
提供机构:
maas
创建时间:
2025-07-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作