Webtext2019zh
收藏OpenDataLab2026-05-17 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/Webtext2019zh
下载链接
链接失效反馈官方服务:
资源简介:
社区问答json版本 (webtext2019zh),包含410万预过滤,高质量的问题和回答。每个问题属于一个 [主题],总共有28,000个各种主题,主题包罗万象。
从1400万原始问题和答案中,选择获得至少3个赞或以上的答案,这意味着回复的内容相对较好或有趣,从而获得高质量的数据集。
除了每个问题对应一个主题、问题的描述和一个或多个回复之外,每个回复还具有回复者的赞数、回复ID和标签。
数据集分区: 数据被重复数据删除并分为三个部分。训练集: 412万; 验证集: 68,000; 测试集a: 68,000; 测试集b,不可下载。
Community QA JSON Dataset (webtext2019zh) consists of 4.1 million pre-filtered, high-quality question-answer pairs. Each question is categorized under a [topic], with a total of 28,000 diverse topics covering a broad spectrum of domains.
We selected answers that received at least 3 upvotes from 14 million original question-answer pairs, to ensure the selected responses are relatively high-quality or engaging, thereby constructing this high-quality dataset.
Apart from the topic, question description, and one or more replies associated with each question, each reply also includes the number of upvotes received by the responder, reply ID, and tags.
Dataset Partitioning: The dataset has undergone deduplication and is split into three subsets. The training set contains 4.12 million samples, the validation set has 68,000 samples, test set a includes 68,000 samples, while test set b is not available for download.
提供机构:
OpenDataLab
创建时间:
2023-03-30
搜集汇总
数据集介绍

背景与挑战
背景概述
Webtext2019zh是一个高质量的中文社区问答数据集,包含410万个预过滤的问题和回答,覆盖28,000个多样主题。数据从1400万原始问答中筛选出获得至少3个赞的答案,确保内容质量,并分为训练集、验证集和测试集,适用于自然语言处理任务如对话生成和问答系统。
以上内容由遇见数据集搜集并总结生成



