md-nishat-008/Code-Mixed-Sentiment-Analysis-Dataset
收藏Hugging Face2023-10-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/md-nishat-008/Code-Mixed-Sentiment-Analysis-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
该数据集基于Amazon Review Dataset,从中随机抽取了100,000个实例,并将原始评分(1到5)重新分类为Positive(评分>3)、Neutral(评分=3)和Negative(评分<3),以确保每个类别的实例数量平衡。为了生成混合代码数据集,应用了两种不同的方法:Krishnan等人(2021)的随机混合代码算法和Santy等人(2021)的r-CM方法。数据集的类别分布在train.csv、dev.csv和test.csv文件中均有详细说明。
This dataset is derived from the Amazon Review Dataset. Specifically, 100,000 instances are randomly sampled, and the original 1–5 star ratings are reclassified into three balanced categories: Positive (rating > 3), Neutral (rating = 3), and Negative (rating < 3) to ensure equal instance counts across each category. To generate the code-mixed variant of this dataset, two distinct methods are adopted: the random code-mixing algorithm proposed by Krishnan et al. (2021) and the r-CM method introduced by Santy et al. (2021). The category distribution of the dataset is detailed in the train.csv, dev.csv, and test.csv files respectively.
提供机构:
md-nishat-008
原始信息汇总
数据集生成
- 基础数据来源:基于Ni et al. (2019)引用的Amazon Review Dataset,从中随机抽取100,000条数据。
- 标签转换:原始数据集的评分(1至5分)被重新分类为Positive(评分>3)、Neutral(评分=3)和Negative(评分<3),确保每个标签的实例数量平衡。
- 合成方法:采用两种不同的方法生成合成Code-mixed数据集,分别是Krishnan et al. (2021)的Random Code-mixing Algorithm和Santy et al. (2021)的r-CM。
类别分布
对于train.csv:
| 标签 | 数量 | 百分比 |
|---|---|---|
| Negative | 20000 | 33.33% |
| Neutral | 20000 | 33.33% |
| Positive | 19999 | 33.33% |
对于dev.csv:
| 标签 | 数量 | 百分比 |
|---|---|---|
| Neutral | 6667 | 33.34% |
| Positive | 6667 | 33.34% |
| Negative | 6666 | 33.33% |
对于test.csv:
| 标签 | 数量 | 百分比 |
|---|---|---|
| Negative | 6667 | 33.34% |
| Positive | 6667 | 33.34% |
| Neutral | 6666 | 33.33% |
引用
如果使用此数据集,请引用以下论文:
bibtex @article{raihan2023mixed, title={Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi}, author={Raihan, Md Nishat and Goswami, Dhiman and Mahmud, Antara}, journal={arXiv preprint arXiv:2309.10272}, year={2023} }



