tanquangduong/emotion-balanced
收藏Hugging Face2024-06-03 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/tanquangduong/emotion-balanced
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: label
dtype:
class_label:
names:
'0': sadness
'1': joy
'2': anger
'3': fear
splits:
- name: train
num_bytes: 719295
num_examples: 6644
- name: validation
num_bytes: 149899
num_examples: 1424
- name: test
num_bytes: 150803
num_examples: 1424
download_size: 602661
dataset_size: 1019997
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
---
## Emotion Dataset
This is the dataset for a sentiment analysis task. This dataset is cloned from [dair-ai/emotion](https://huggingface.co/datasets/dair-ai/emotion) with the following adaptations.
The original train dataset does not have a good balance of label distribution.
```
label_name
surprise 719
love 1641
fear 2373
anger 2709
sadness 5797
joy 6761
```
I created a balanced subset from the original dataset with the following processing steps:
- Selected the four highest number labels: fear (2373), anger (2709), sadness (5797), joy (6761)
- Down-sampled the majority classes to match the minority class and reordered the label ids
```
label label_name
0 sadness 2373
1 joy 2373
2 anger 2373
3 fear 2373
```
- Split the downsampled dataframe into train, validation, and test subsets with the ratio 70:15:15. Finally, we obtained:
```
train_ds: 6644
validation_ds: 1424
test_ds: 1424
```
- Added ClassLabel to all Dataset objects for the 'train', 'validation', and 'test' datasets
提供机构:
tanquangduong
原始信息汇总
数据集概述
数据集特征
- text: 文本数据,数据类型为字符串。
- label: 标签数据,数据类型为分类标签,包含以下类别:
- 0: sadness
- 1: joy
- 2: anger
- 3: fear
数据集划分
- train: 训练集,包含6644个样本,总大小为719295字节。
- validation: 验证集,包含1424个样本,总大小为149899字节。
- test: 测试集,包含1424个样本,总大小为150803字节。
数据集大小
- download_size: 下载大小为602661字节。
- dataset_size: 数据集总大小为1019997字节。
数据集配置
- config_name: default
- data_files:
- train: 路径为
data/train-* - validation: 路径为
data/validation-* - test: 路径为
data/test-*
- train: 路径为
数据集处理
- 从原始数据集中选择四个最高频标签:fear (2373), anger (2709), sadness (5797), joy (6761)。
- 对多数类进行下采样,使其与少数类匹配,并重新排序标签ID。
- 将下采样后的数据集按70:15:15的比例划分为训练集、验证集和测试集。
- 为train, validation, 和 test数据集的所有Dataset对象添加ClassLabel。



