C3

Name: C3
Creator: maas
Published: 2025-09-22 15:05:35
License: 暂无描述

魔搭社区2025-09-22 更新2024-08-31 收录

下载链接：

https://modelscope.cn/datasets/OmniData/C3

下载链接

链接失效反馈

官方服务：

资源简介：

displayName: C3 labelTypes: - Chinese Corpus license: - C3 Custom mediaTypes: - Text paperUrl: https://arxiv.org/pdf/1904.09679v3.pdf publishDate: "2019-01-01" publishUrl: https://github.com/nlpdata/c3 publisher: - Cornell University - Tencent AI Lab tags: [] taskTypes: - Machine Reading Comprehension - Reading Comprehension - Language Modelling - Common Sense Reasoning Few-Shot - Common Sense Reasoning Zero-Shot - Common Sense Reasoning One-Shot --- ## 简介 C3 是一个自由形式的多选中文机器阅读理解数据集。我们展示了第一个自由形式的多选中文机器阅读理解数据集（C^3），包含 13,369 个文档（对话或更正式的混合体裁文本）及其相关的 19,577 个从中文收集的自由形式选择题-作为第二语言的考试。我们对这些现实世界问题所需的先验知识（即语言、特定领域和一般世界知识）进行了全面分析。我们实施了基于规则和流行的神经方法，发现性能最佳的模型 (68.5%) 和人类读者 (96.0%) 之间仍然存在显着的性能差距，尤其是在需要先验知识的问题上。我们进一步研究了基于英语翻译相关数据集的干扰物合理性和数据增强对模型性能的影响。我们预计 C^3 将对现有系统提出巨大挑战，因为回答 86.8% 的问题需要随附文档内外的知识，我们希望 C^3 可以作为研究如何利用各种先验知识的平台更好地理解给定的书面或口头文本。 C^3 可在 https://dataset.org/c3/ 获得。 ## 引文 ``` @article{sun2020investigating, title={Investigating prior knowledge for challenging chinese machine reading comprehension}, author={Sun, Kai and Yu, Dian and Yu, Dong and Cardie, Claire}, journal={Transactions of the Association for Computational Linguistics}, volume={8}, pages={141--155}, year={2020}, publisher={MIT Press} } ``` ## Download dataset :modelscope-code[]{type="git"}

displayName: C3 labelTypes: - 中文语料库（Chinese Corpus） license: - C3 自定义许可 mediaTypes: - 文本（Text） paperUrl: https://arxiv.org/pdf/1904.09679v3.pdf publishDate: "2019-01-01" publishUrl: https://github.com/nlpdata/c3 publisher: - 康奈尔大学（Cornell University） - 腾讯人工智能实验室（Tencent AI Lab） tags: [] taskTypes: - 机器阅读理解（Machine Reading Comprehension） - 阅读理解（Reading Comprehension） - 语言建模（Language Modelling） - 少样本常识推理（Common Sense Reasoning Few-Shot） - 零样本常识推理（Common Sense Reasoning Zero-Shot） - 单样本常识推理（Common Sense Reasoning One-Shot） --- ## 简介 C³是首个自由形式多选式中文机器阅读理解数据集。本数据集包含13369篇文档（涵盖对话体与正式文体的混合体裁文本），以及19577道源自中文第二语言考试的自由形式选择题。我们针对这些真实世界问题所需的先验知识——包括语言知识、特定领域知识与通用世界知识——展开了全面分析。我们分别实现了基于规则的方法与主流神经方法，实验结果显示，性能最优的模型准确率仅为68.5%，与人类读者的96.0%仍存在显著性能差距，在需要依赖先验知识的问题上这一差距尤为突出。我们进一步探究了基于相关英文翻译数据集的干扰项合理性，以及数据增强手段对模型性能的影响。由于86.8%的问题回答需要结合文档内外的知识，我们认为C³将对现有系统构成极大挑战；同时我们期望C³能够作为研究平台，助力探索如何借助各类先验知识更好地理解书面与口头文本。C³数据集可通过https://dataset.org/c3/获取。 ## 引文 @article{sun2020investigating, title={Investigating prior knowledge for challenging chinese machine reading comprehension}, author={Sun, Kai and Yu, Dian and Yu, Dong and Cardie, Claire}, journal={Transactions of the Association for Computational Linguistics}, volume={8}, pages={141--155}, year={2020}, publisher={MIT Press} } ## 下载数据集 :modelscope-code[]{type="git"}

提供机构：

maas

创建时间：

2024-07-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集