KETI-AIR/kor_duorc
收藏Hugging Face2023-11-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/KETI-AIR/kor_duorc
下载链接
链接失效反馈官方服务:
资源简介:
DuoRC是一个用于问答和文本生成任务的数据集,包含两个子集:ParaphraseRC和SelfRC。每个子集都有训练、验证和测试集。数据集的特征包括plot_id、plot、title、question_id、question、answers、data_index_by_user和no_answer。数据集的语言为韩语,许可证为MIT,规模在10K到1M之间,且为单语数据集。
DuoRC is a dataset developed for question answering and text generation tasks, consisting of two subsets: ParaphraseRC and SelfRC. Each subset contains training, validation, and test splits. The features of the dataset include plot_id, plot, title, question_id, question, answers, data_index_by_user, and no_answer. This is a monolingual Korean dataset with an MIT license, and its size ranges from 10K to 1M.
提供机构:
KETI-AIR
原始信息汇总
数据集概述
基本信息
- 语言: 韩语 (ko)
- 许可证: MIT
- 多语言性: 单语种 (monolingual)
- 数据集大小: 10K<n<1M
- 数据来源: 原始数据 (original)
- 任务类别:
- 问答 (question-answering)
- 文本生成 (text2text-generation)
- 任务ID:
- 抽象问答 (abstractive-qa)
- 抽取问答 (extractive-qa)
- Papers with Code ID: duorc
- 易读名称: DuoRC
数据集结构
特征
- plot_id: 字符串 (string)
- plot: 字符串 (string)
- title: 字符串 (string)
- question_id: 字符串 (string)
- question: 字符串 (string)
- answers: 字符串序列 (sequence: string)
- data_index_by_user: 整数 (int32)
- no_answer: 布尔值 (bool)
数据分割
- ParaphraseRC_train:
- 字节数: 602587402
- 样本数: 69524
- ParaphraseRC_validation:
- 字节数: 128698408
- 样本数: 15591
- ParaphraseRC_test:
- 字节数: 140358453
- 样本数: 15857
- SelfRC_train:
- 字节数: 288548501
- 样本数: 60721
- SelfRC_validation:
- 字节数: 62171721
- 样本数: 12961
- SelfRC_test:
- 字节数: 59196683
- 样本数: 12559
下载与数据集大小
- 下载大小: 67341464 字节
- 数据集大小: 1281561168 字节



