cestwc/text_classification
收藏Hugging Face2024-02-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/cestwc/text_classification
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: ag_news
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': World
'1': Sports
'2': Business
'3': Sci/Tech
- name: glove
sequence: float64
- name: word2vec
sequence: float64
- name: fasttext
sequence: float64
splits:
- name: train
num_bytes: 747787977
num_examples: 127600
download_size: 717630530
dataset_size: 747787977
- config_name: amazon_reviews
features:
- name: text
dtype: string
- name: label
dtype: int64
- name: glove
sequence: float64
- name: word2vec
sequence: float64
- name: fasttext
sequence: float64
splits:
- name: train
num_bytes: 1218704865
num_examples: 210000
download_size: 1147756545
dataset_size: 1218704865
- config_name: emotion
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': sadness
'1': joy
'2': love
'3': anger
'4': fear
'5': surprise
- name: fasttext
sequence: float64
- name: glove
sequence: float64
- name: word2vec
sequence: float64
splits:
- name: train
num_bytes: 114413401
num_examples: 20000
download_size: 104458522
dataset_size: 114413401
- config_name: imdb
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': neg
'1': pos
- name: fasttext
sequence: float64
- name: glove
sequence: float64
- name: word2vec
sequence: float64
splits:
- name: train
num_bytes: 346683508
num_examples: 50000
download_size: 344879514
dataset_size: 346683508
- config_name: multi_nli
features:
- name: label
dtype:
class_label:
names:
'0': entailment
'1': neutral
'2': contradiction
- name: text
dtype: string
- name: glove
sequence: float64
- name: word2vec
sequence: float64
- name: fasttext
sequence: float64
splits:
- name: train
num_bytes: 2389531917
num_examples: 412349
download_size: 2243248541
dataset_size: 2389531917
- config_name: tweet_eval
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': negative
'1': neutral
'2': positive
- name: glove
sequence: float64
- name: word2vec
sequence: float64
- name: fasttext
sequence: float64
splits:
- name: train
num_bytes: 343075422
num_examples: 59899
download_size: 315331899
dataset_size: 343075422
- config_name: yelp_review_full
features:
- name: label
dtype:
class_label:
names:
'0': 1 star
'1': 2 star
'2': 3 stars
'3': 4 stars
'4': 5 stars
- name: text
dtype: string
- name: glove
sequence: float64
- name: word2vec
sequence: float64
- name: fasttext
sequence: float64
splits:
- name: train
num_bytes: 4449129014
num_examples: 700000
download_size: 4414593456
dataset_size: 4449129014
configs:
- config_name: ag_news
data_files:
- split: train
path: ag_news/train-*
- config_name: amazon_reviews
data_files:
- split: train
path: amazon_reviews/train-*
- config_name: emotion
data_files:
- split: train
path: emotion/train-*
- config_name: imdb
data_files:
- split: train
path: imdb/train-*
- config_name: multi_nli
data_files:
- split: train
path: multi_nli/train-*
- config_name: tweet_eval
data_files:
- split: train
path: tweet_eval/train-*
- config_name: yelp_review_full
data_files:
- split: train
path: yelp_review_full/train-*
---
提供机构:
cestwc
原始信息汇总
数据集概述
数据集配置
1. AG News
- 特征:
text: 类型为stringlabel: 类型为class_label,标签名称为World,Sports,Business,Sci/Techglove: 类型为sequence,数据类型为float64word2vec: 类型为sequence,数据类型为float64fasttext: 类型为sequence,数据类型为float64
- 分割:
train: 字节数为 747787977,样本数为 127600
- 下载大小: 717630530 字节
- 数据集大小: 747787977 字节
2. Amazon Reviews
- 特征:
text: 类型为stringlabel: 类型为int64glove: 类型为sequence,数据类型为float64word2vec: 类型为sequence,数据类型为float64fasttext: 类型为sequence,数据类型为float64
- 分割:
train: 字节数为 1218704865,样本数为 210000
- 下载大小: 1147756545 字节
- 数据集大小: 1218704865 字节
3. Emotion
- 特征:
text: 类型为stringlabel: 类型为class_label,标签名称为sadness,joy,love,anger,fear,surprisefasttext: 类型为sequence,数据类型为float64glove: 类型为sequence,数据类型为float64word2vec: 类型为sequence,数据类型为float64
- 分割:
train: 字节数为 114413401,样本数为 20000
- 下载大小: 104458522 字节
- 数据集大小: 114413401 字节
4. IMDB
- 特征:
text: 类型为stringlabel: 类型为class_label,标签名称为neg,posfasttext: 类型为sequence,数据类型为float64glove: 类型为sequence,数据类型为float64word2vec: 类型为sequence,数据类型为float64
- 分割:
train: 字节数为 346683508,样本数为 50000
- 下载大小: 344879514 字节
- 数据集大小: 346683508 字节
5. Multi NLI
- 特征:
label: 类型为class_label,标签名称为entailment,neutral,contradictiontext: 类型为stringglove: 类型为sequence,数据类型为float64word2vec: 类型为sequence,数据类型为float64fasttext: 类型为sequence,数据类型为float64
- 分割:
train: 字节数为 2389531917,样本数为 412349
- 下载大小: 2243248541 字节
- 数据集大小: 2389531917 字节
6. Tweet Eval
- 特征:
text: 类型为stringlabel: 类型为class_label,标签名称为negative,neutral,positiveglove: 类型为sequence,数据类型为float64word2vec: 类型为sequence,数据类型为float64fasttext: 类型为sequence,数据类型为float64
- 分割:
train: 字节数为 343075422,样本数为 59899
- 下载大小: 315331899 字节
- 数据集大小: 343075422 字节
7. Yelp Review Full
- 特征:
label: 类型为class_label,标签名称为1 star,2 star,3 stars,4 stars,5 starstext: 类型为stringglove: 类型为sequence,数据类型为float64word2vec: 类型为sequence,数据类型为float64fasttext: 类型为sequence,数据类型为float64
- 分割:
train: 字节数为 4449129014,样本数为 700000
- 下载大小: 4414593456 字节
- 数据集大小: 4449129014 字节



