BEE-spoke-data/stackoverflow-questions-long
收藏Hugging Face2023-12-29 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/BEE-spoke-data/stackoverflow-questions-long
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
size_categories:
- 100K<n<1M
source_datasets: pacovaldez/stackoverflow-questions
task_categories:
- text-classification
- text-generation
dataset_info:
- config_name: default
features:
- name: title
dtype: string
- name: body
dtype: string
- name: label
dtype: int64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 1082904744
num_examples: 212663
- name: validation
num_bytes: 25509099.6585352
num_examples: 5000
- name: test
num_bytes: 25510304.23774933
num_examples: 5000
download_size: 461549130
dataset_size: 1133924147.8962846
- config_name: original
features:
- name: title
dtype: string
- name: body
dtype: string
- name: label
dtype: int64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 1082904744
num_examples: 212663
- name: validation
num_bytes: 539369505
num_examples: 105721
- name: test
num_bytes: 1078141988
num_examples: 211315
download_size: 1099545678
dataset_size: 2700416237
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
- config_name: original
data_files:
- split: train
path: original/train-*
- split: validation
path: original/validation-*
- split: test
path: original/test-*
---
# stackoverflow questions for text classification: 'long'
This is `pacovaldez/stackoverflow-questions` filtered for 1024 GPT2 tokens or more in `title` + `body`
https://huggingface.co/datasets/pacovaldez/stackoverflow-questions
---
提供机构:
BEE-spoke-data
原始信息汇总
数据集概述
基本信息
- 许可证: Apache-2.0
- 数据集大小: 100K<n<1M
- 来源: pacovaldez/stackoverflow-questions
- 任务类别:
- 文本分类
- 文本生成
数据集配置
默认配置
- 配置名称: default
- 特征:
title: 字符串body: 字符串label: 64位整数token_count: 64位整数
- 分割:
train:- 字节数: 1082904744
- 样本数: 212663
validation:- 字节数: 25509099.6585352
- 样本数: 5000
test:- 字节数: 25510304.23774933
- 样本数: 5000
- 下载大小: 461549130
- 数据集大小: 1133924147.8962846
原始配置
- 配置名称: original
- 特征:
title: 字符串body: 字符串label: 64位整数token_count: 64位整数
- 分割:
train:- 字节数: 1082904744
- 样本数: 212663
validation:- 字节数: 539369505
- 样本数: 105721
test:- 字节数: 1078141988
- 样本数: 211315
- 下载大小: 1099545678
- 数据集大小: 2700416237
数据文件
默认配置
- 数据文件:
train: data/train-*validation: data/validation-*test: data/test-*
原始配置
- 数据文件:
train: original/train-*validation: original/validation-*test: original/test-*



