BEE-spoke-data/stackoverflow-questions-long

Name: BEE-spoke-data/stackoverflow-questions-long
Creator: BEE-spoke-data
Published: 2023-12-29 16:39:02
License: 暂无描述

Hugging Face2023-12-29 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/BEE-spoke-data/stackoverflow-questions-long

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 size_categories: - 100K<n<1M source_datasets: pacovaldez/stackoverflow-questions task_categories: - text-classification - text-generation dataset_info: - config_name: default features: - name: title dtype: string - name: body dtype: string - name: label dtype: int64 - name: token_count dtype: int64 splits: - name: train num_bytes: 1082904744 num_examples: 212663 - name: validation num_bytes: 25509099.6585352 num_examples: 5000 - name: test num_bytes: 25510304.23774933 num_examples: 5000 download_size: 461549130 dataset_size: 1133924147.8962846 - config_name: original features: - name: title dtype: string - name: body dtype: string - name: label dtype: int64 - name: token_count dtype: int64 splits: - name: train num_bytes: 1082904744 num_examples: 212663 - name: validation num_bytes: 539369505 num_examples: 105721 - name: test num_bytes: 1078141988 num_examples: 211315 download_size: 1099545678 dataset_size: 2700416237 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* - config_name: original data_files: - split: train path: original/train-* - split: validation path: original/validation-* - split: test path: original/test-* --- # stackoverflow questions for text classification: 'long' This is `pacovaldez/stackoverflow-questions` filtered for 1024 GPT2 tokens or more in `title` + `body` https://huggingface.co/datasets/pacovaldez/stackoverflow-questions ---

提供机构：

BEE-spoke-data

原始信息汇总

数据集概述

基本信息

许可证: Apache-2.0
数据集大小: 100K<n<1M
来源: pacovaldez/stackoverflow-questions
任务类别:
- 文本分类
- 文本生成

数据集配置

默认配置

配置名称: default
特征:
- title: 字符串
- body: 字符串
- label: 64位整数
- token_count: 64位整数
分割:
- train:
  - 字节数: 1082904744
  - 样本数: 212663
- validation:
  - 字节数: 25509099.6585352
  - 样本数: 5000
- test:
  - 字节数: 25510304.23774933
  - 样本数: 5000
下载大小: 461549130
数据集大小: 1133924147.8962846

原始配置

配置名称: original
特征:
- title: 字符串
- body: 字符串
- label: 64位整数
- token_count: 64位整数
分割:
- train:
  - 字节数: 1082904744
  - 样本数: 212663
- validation:
  - 字节数: 539369505
  - 样本数: 105721
- test:
  - 字节数: 1078141988
  - 样本数: 211315
下载大小: 1099545678
数据集大小: 2700416237

数据文件

默认配置

数据文件:
- train: data/train-*
- validation: data/validation-*
- test: data/test-*

原始配置

数据文件:
- train: original/train-*
- validation: original/validation-*
- test: original/test-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集