mteb/reddit-clustering-p2p

Name: mteb/reddit-clustering-p2p
Creator: mteb
Published: 2025-05-04 16:28:35
License: 暂无描述

Hugging Face2025-05-04 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/mteb/reddit-clustering-p2p

下载链接

链接失效反馈

官方服务：

资源简介：

RedditClusteringP2P.v2是一个由MTEB创建的单语英语数据集，用于文本分类任务。该数据集包含从Reddit网站收集的标题和帖子数据，并对这些数据进行聚类。数据集由10组50k段落的聚类和40组10k段落的聚类组成。数据集的统计信息显示，它包含459389个样本，总字符数为334286895，文本长度从79到4359不等。每个文本的标签数量从2到77908不等，共有440个独特的标签。

RedditClusteringP2P.v2 is a monolingual English dataset created by MTEB for text classification tasks. It includes data from titles and posts collected from the Reddit website and performs clustering on these data. The dataset consists of 10 sets of 50k paragraph clusters and 40 sets of 10k paragraph clusters. The dataset statistics show that it contains 459389 samples, with a total of 334286895 characters, text lengths ranging from 79 to 4359, and label counts per text ranging from 2 to 77908, with a total of 440 unique labels.

提供机构：

mteb

原始信息汇总

数据集概述

本数据集包含10个子集，每个子集的统计信息如下：

标签数：91，样本数：15592
标签数：64，样本数：79172
标签数：38，样本数：1942
标签数：11，样本数：13224
标签数：64，样本数：92303
标签数：87，样本数：28607
标签数：10，样本数：69146
标签数：48，样本数：67469
标签数：64，样本数：29683
标签数：31，样本数：62261

数据集通过mteb GitHub仓库提供的脚本随机选择。

5,000+

优质数据集

54 个

任务类型

进入经典数据集