md-nishat-008/Code-Mixed-Sentiment-Analysis-Dataset

Name: md-nishat-008/Code-Mixed-Sentiment-Analysis-Dataset
Creator: md-nishat-008
Published: 2023-10-02 21:27:24
License: 暂无描述

Hugging Face2023-10-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/md-nishat-008/Code-Mixed-Sentiment-Analysis-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集基于Amazon Review Dataset，从中随机抽取了100,000个实例，并将原始评分（1到5）重新分类为Positive（评分>3）、Neutral（评分=3）和Negative（评分<3），以确保每个类别的实例数量平衡。为了生成混合代码数据集，应用了两种不同的方法：Krishnan等人（2021）的随机混合代码算法和Santy等人（2021）的r-CM方法。数据集的类别分布在train.csv、dev.csv和test.csv文件中均有详细说明。

This dataset is derived from the Amazon Review Dataset. Specifically, 100,000 instances are randomly sampled, and the original 1–5 star ratings are reclassified into three balanced categories: Positive (rating > 3), Neutral (rating = 3), and Negative (rating < 3) to ensure equal instance counts across each category. To generate the code-mixed variant of this dataset, two distinct methods are adopted: the random code-mixing algorithm proposed by Krishnan et al. (2021) and the r-CM method introduced by Santy et al. (2021). The category distribution of the dataset is detailed in the train.csv, dev.csv, and test.csv files respectively.

提供机构：

md-nishat-008

原始信息汇总

数据集生成

基础数据来源：基于Ni et al. (2019)引用的Amazon Review Dataset，从中随机抽取100,000条数据。
标签转换：原始数据集的评分（1至5分）被重新分类为Positive（评分>3）、Neutral（评分=3）和Negative（评分<3），确保每个标签的实例数量平衡。
合成方法：采用两种不同的方法生成合成Code-mixed数据集，分别是Krishnan et al. (2021)的Random Code-mixing Algorithm和Santy et al. (2021)的r-CM。

类别分布

对于train.csv：

标签	数量	百分比
Negative	20000	33.33%
Neutral	20000	33.33%
Positive	19999	33.33%

对于dev.csv：

标签	数量	百分比
Neutral	6667	33.34%
Positive	6667	33.34%
Negative	6666	33.33%

对于test.csv：

标签	数量	百分比
Negative	6667	33.34%
Positive	6667	33.34%
Neutral	6666	33.33%

引用

如果使用此数据集，请引用以下论文：

bibtex @article{raihan2023mixed, title={Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi}, author={Raihan, Md Nishat and Goswami, Dhiman and Mahmud, Antara}, journal={arXiv preprint arXiv:2309.10272}, year={2023} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集