Webtext2019zh

Name: Webtext2019zh
Creator: OpenDataLab
Published: 2026-05-17 13:30:39
License: 暂无描述

OpenDataLab2026-05-17 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/Webtext2019zh

下载链接

链接失效反馈

官方服务：

资源简介：

社区问答json版本 (webtext2019zh)，包含410万预过滤，高质量的问题和回答。每个问题属于一个 [主题]，总共有28,000个各种主题，主题包罗万象。从1400万原始问题和答案中，选择获得至少3个赞或以上的答案，这意味着回复的内容相对较好或有趣，从而获得高质量的数据集。除了每个问题对应一个主题、问题的描述和一个或多个回复之外，每个回复还具有回复者的赞数、回复ID和标签。数据集分区: 数据被重复数据删除并分为三个部分。训练集: 412万; 验证集: 68,000; 测试集a: 68,000; 测试集b，不可下载。

Community QA JSON Dataset (webtext2019zh) consists of 4.1 million pre-filtered, high-quality question-answer pairs. Each question is categorized under a [topic], with a total of 28,000 diverse topics covering a broad spectrum of domains. We selected answers that received at least 3 upvotes from 14 million original question-answer pairs, to ensure the selected responses are relatively high-quality or engaging, thereby constructing this high-quality dataset. Apart from the topic, question description, and one or more replies associated with each question, each reply also includes the number of upvotes received by the responder, reply ID, and tags. Dataset Partitioning: The dataset has undergone deduplication and is split into three subsets. The training set contains 4.12 million samples, the validation set has 68,000 samples, test set a includes 68,000 samples, while test set b is not available for download.

提供机构：

OpenDataLab

创建时间：

2023-03-30

搜集汇总

数据集介绍