five

cestwc/text_classification

收藏
Hugging Face2024-02-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/cestwc/text_classification
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: ag_news features: - name: text dtype: string - name: label dtype: class_label: names: '0': World '1': Sports '2': Business '3': Sci/Tech - name: glove sequence: float64 - name: word2vec sequence: float64 - name: fasttext sequence: float64 splits: - name: train num_bytes: 747787977 num_examples: 127600 download_size: 717630530 dataset_size: 747787977 - config_name: amazon_reviews features: - name: text dtype: string - name: label dtype: int64 - name: glove sequence: float64 - name: word2vec sequence: float64 - name: fasttext sequence: float64 splits: - name: train num_bytes: 1218704865 num_examples: 210000 download_size: 1147756545 dataset_size: 1218704865 - config_name: emotion features: - name: text dtype: string - name: label dtype: class_label: names: '0': sadness '1': joy '2': love '3': anger '4': fear '5': surprise - name: fasttext sequence: float64 - name: glove sequence: float64 - name: word2vec sequence: float64 splits: - name: train num_bytes: 114413401 num_examples: 20000 download_size: 104458522 dataset_size: 114413401 - config_name: imdb features: - name: text dtype: string - name: label dtype: class_label: names: '0': neg '1': pos - name: fasttext sequence: float64 - name: glove sequence: float64 - name: word2vec sequence: float64 splits: - name: train num_bytes: 346683508 num_examples: 50000 download_size: 344879514 dataset_size: 346683508 - config_name: multi_nli features: - name: label dtype: class_label: names: '0': entailment '1': neutral '2': contradiction - name: text dtype: string - name: glove sequence: float64 - name: word2vec sequence: float64 - name: fasttext sequence: float64 splits: - name: train num_bytes: 2389531917 num_examples: 412349 download_size: 2243248541 dataset_size: 2389531917 - config_name: tweet_eval features: - name: text dtype: string - name: label dtype: class_label: names: '0': negative '1': neutral '2': positive - name: glove sequence: float64 - name: word2vec sequence: float64 - name: fasttext sequence: float64 splits: - name: train num_bytes: 343075422 num_examples: 59899 download_size: 315331899 dataset_size: 343075422 - config_name: yelp_review_full features: - name: label dtype: class_label: names: '0': 1 star '1': 2 star '2': 3 stars '3': 4 stars '4': 5 stars - name: text dtype: string - name: glove sequence: float64 - name: word2vec sequence: float64 - name: fasttext sequence: float64 splits: - name: train num_bytes: 4449129014 num_examples: 700000 download_size: 4414593456 dataset_size: 4449129014 configs: - config_name: ag_news data_files: - split: train path: ag_news/train-* - config_name: amazon_reviews data_files: - split: train path: amazon_reviews/train-* - config_name: emotion data_files: - split: train path: emotion/train-* - config_name: imdb data_files: - split: train path: imdb/train-* - config_name: multi_nli data_files: - split: train path: multi_nli/train-* - config_name: tweet_eval data_files: - split: train path: tweet_eval/train-* - config_name: yelp_review_full data_files: - split: train path: yelp_review_full/train-* ---
提供机构:
cestwc
原始信息汇总

数据集概述

数据集配置

1. AG News

  • 特征:
    • text: 类型为 string
    • label: 类型为 class_label,标签名称为 World, Sports, Business, Sci/Tech
    • glove: 类型为 sequence,数据类型为 float64
    • word2vec: 类型为 sequence,数据类型为 float64
    • fasttext: 类型为 sequence,数据类型为 float64
  • 分割:
    • train: 字节数为 747787977,样本数为 127600
  • 下载大小: 717630530 字节
  • 数据集大小: 747787977 字节

2. Amazon Reviews

  • 特征:
    • text: 类型为 string
    • label: 类型为 int64
    • glove: 类型为 sequence,数据类型为 float64
    • word2vec: 类型为 sequence,数据类型为 float64
    • fasttext: 类型为 sequence,数据类型为 float64
  • 分割:
    • train: 字节数为 1218704865,样本数为 210000
  • 下载大小: 1147756545 字节
  • 数据集大小: 1218704865 字节

3. Emotion

  • 特征:
    • text: 类型为 string
    • label: 类型为 class_label,标签名称为 sadness, joy, love, anger, fear, surprise
    • fasttext: 类型为 sequence,数据类型为 float64
    • glove: 类型为 sequence,数据类型为 float64
    • word2vec: 类型为 sequence,数据类型为 float64
  • 分割:
    • train: 字节数为 114413401,样本数为 20000
  • 下载大小: 104458522 字节
  • 数据集大小: 114413401 字节

4. IMDB

  • 特征:
    • text: 类型为 string
    • label: 类型为 class_label,标签名称为 neg, pos
    • fasttext: 类型为 sequence,数据类型为 float64
    • glove: 类型为 sequence,数据类型为 float64
    • word2vec: 类型为 sequence,数据类型为 float64
  • 分割:
    • train: 字节数为 346683508,样本数为 50000
  • 下载大小: 344879514 字节
  • 数据集大小: 346683508 字节

5. Multi NLI

  • 特征:
    • label: 类型为 class_label,标签名称为 entailment, neutral, contradiction
    • text: 类型为 string
    • glove: 类型为 sequence,数据类型为 float64
    • word2vec: 类型为 sequence,数据类型为 float64
    • fasttext: 类型为 sequence,数据类型为 float64
  • 分割:
    • train: 字节数为 2389531917,样本数为 412349
  • 下载大小: 2243248541 字节
  • 数据集大小: 2389531917 字节

6. Tweet Eval

  • 特征:
    • text: 类型为 string
    • label: 类型为 class_label,标签名称为 negative, neutral, positive
    • glove: 类型为 sequence,数据类型为 float64
    • word2vec: 类型为 sequence,数据类型为 float64
    • fasttext: 类型为 sequence,数据类型为 float64
  • 分割:
    • train: 字节数为 343075422,样本数为 59899
  • 下载大小: 315331899 字节
  • 数据集大小: 343075422 字节

7. Yelp Review Full

  • 特征:
    • label: 类型为 class_label,标签名称为 1 star, 2 star, 3 stars, 4 stars, 5 stars
    • text: 类型为 string
    • glove: 类型为 sequence,数据类型为 float64
    • word2vec: 类型为 sequence,数据类型为 float64
    • fasttext: 类型为 sequence,数据类型为 float64
  • 分割:
    • train: 字节数为 4449129014,样本数为 700000
  • 下载大小: 4414593456 字节
  • 数据集大小: 4449129014 字节
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作