Elise-hf/reddit_categories_clean

Name: Elise-hf/reddit_categories_clean
Creator: Elise-hf
Published: 2023-07-29 16:29:41
License: 暂无描述

Hugging Face2023-07-29 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Elise-hf/reddit_categories_clean

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: id dtype: string - name: subreddit dtype: string - name: title dtype: string - name: raw_text dtype: string - name: category dtype: string - name: subcategory dtype: string - name: text_word_count dtype: int64 - name: title_word_count dtype: int64 - name: text dtype: string - name: __index_level_0__ dtype: int64 splits: - name: train num_bytes: 1367599761 num_examples: 810400 - name: validation num_bytes: 171110350 num_examples: 101300 - name: test num_bytes: 171511889 num_examples: 101300 download_size: 1116585391 dataset_size: 1710222000 --- # Dataset Card for "reddit_categories_clean" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

提供机构：

Elise-hf

原始信息汇总

数据集概述

数据集名称

名称：reddit_categories_clean

数据集配置

默认配置（config_name: default）包含以下数据文件：
- 训练集（split: train）路径：data/train-*
- 验证集（split: validation）路径：data/validation-*
- 测试集（split: test）路径：data/test-*

数据集特征

特征列表：
- id: 数据类型为字符串（string）
- subreddit: 数据类型为字符串（string）
- title: 数据类型为字符串（string）
- raw_text: 数据类型为字符串（string）
- category: 数据类型为字符串（string）
- subcategory: 数据类型为字符串（string）
- text_word_count: 数据类型为整数（int64）
- title_word_count: 数据类型为整数（int64）
- text: 数据类型为字符串（string）
- index_level_0: 数据类型为整数（int64）

数据集分割信息

训练集（train）：
- 数据量：810,400个样本
- 存储大小：1,367,599,761字节
验证集（validation）：
- 数据量：101,300个样本
- 存储大小：171,110,350字节
测试集（test）：
- 数据量：101,300个样本
- 存储大小：171,511,889字节

数据集大小

下载大小：1,116,585,391字节
数据集总大小：1,710,222,000字节

5,000+

优质数据集

54 个

任务类型

进入经典数据集