NLU-Sentiment-Analysis
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/aisingapore/NLU-Sentiment-Analysis
下载链接
链接失效反馈官方服务:
资源简介:
# SEA Sentiment Analysis
SEA Sentiment Analysis evaluates a model's ability to identify the sentiment polarity of a text. It is sampled from [NusaX](https://aclanthology.org/2023.eacl-main.57/) for Indonesian, Javanese, and Sundanese, [IndicSentiment](https://aclanthology.org/2023.acl-long.693) for Tamil, [Wisesight Sentiment](https://doi.org/10.5281/zenodo.3457446) for Thai, and [UIT-VSFC](https://www.researchgate.net/publication/329645066_UIT-VSFC_Vietnamese_Students%27_Feedback_Corpus_for_Sentiment_Analysis) for Vietnamese.
### Supported Tasks and Leaderboards
SEA Sentiment Analysis is designed for evaluating chat or instruction-tuned large language models (LLMs). It is part of the [SEA-HELM](https://leaderboard.sea-lion.ai/) leaderboard from [AI Singapore](https://aisingapore.org/).
### Languages
- Indonesian (id)
- Javanese (jv)
- Sundanese (su)
- Tamil (ta)
- Thai (th)
- Vietnamese (vi)
### Dataset Details
SEA Sentiment Analysis is split by language, with additional splits containing fewshot examples. Below are the statistics for this dataset. The number of tokens only refer to the strings of text found within the `prompts` column.
| Split | # of examples | # of GPT-4o tokens | # of Gemma 2 tokens | # of Llama 3 tokens |
|-|:-|:-|:-|:-|
| id | 400 | 15131 | 13918 | 19274
| jv | 394 | 16731 | 17453 | 20638
| su | 394 | 17123 | 18632 | 22056
| ta | 1000 | 54038 | 71449 | 211075
| th | 1000 | 38252 | 38111 | 4444
| vi | 1000 | 16732 | 16307 | 16755
| id_fewshot | 5 | 145 | 137 | 175
| jv_fewshot | 5 | 201 | 219 | 255
| su_fewshot | 5 | 201 | 217 | 253
| ta_fewshot | 5 | 192 | 264 | 792
| th_fewshot | 5 | 50 | 54 | 63
| vi_fewshot | 5 | 87 | 87 | 90
| **total** | 4218 | 158883 | 176848 | 335873 |
### Data Sources
| Data Source | License | Language/s | Split/s
|-|:-|:-| :-|
| [NusaX-Senti](https://huggingface.co/datasets/indonlp/NusaX-senti) | [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) | Indonesian, Javanese, Sundanese | id, id_fewshot, jv, jv_fewshot, su, su_fewshot
| [IndicSentiment](https://huggingface.co/datasets/ai4bharat/IndicQA) | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | Tamil | ta, ta_fewshot
| [Wisesight Sentiment](https://github.com/PyThaiNLP/wisesight-sentiment) | [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/) | Thai | th, th_fewshot
| [UIT-VSFC](https://huggingface.co/datasets/uitnlp/vietnamese_students_feedback) | - | Vietnamese | vi, vi_fewshot
### License
For the license/s of the dataset/s, please refer to the data sources table above.
We endeavor to ensure data used is permissible and have chosen datasets from creators who have processes to exclude copyrighted or disputed data.
## Acknowledgement
This project is supported by the National Research Foundation Singapore and Infocomm Media Development Authority (IMDA),
Singapore under its National Large Language Model Funding Initiative.
### References
```bibtex
@inproceedings{winata-etal-2023-nusax,
title = "{N}usa{X}: Multilingual Parallel Sentiment Dataset for 10 {I}ndonesian Local Languages",
author = "Winata, Genta Indra and
Aji, Alham Fikri and
Cahyawijaya, Samuel and
Mahendra, Rahmad and
Koto, Fajri and
Romadhony, Ade and
Kurniawan, Kemal and
Moeljadi, David and
Prasojo, Radityo Eko and
Fung, Pascale and
Baldwin, Timothy and
Lau, Jey Han and
Sennrich, Rico and
Ruder, Sebastian",
editor = "Vlachos, Andreas and
Augenstein, Isabelle",
booktitle = "Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.eacl-main.57",
doi = "10.18653/v1/2023.eacl-main.57",
pages = "815--834",
}
@inproceedings{doddapaneni-etal-2023-towards,
title = "Towards Leaving No {I}ndic Language Behind: Building Monolingual Corpora, Benchmark and Models for {I}ndic Languages",
author = "Doddapaneni, Sumanth and
Aralikatte, Rahul and
Ramesh, Gowtham and
Goyal, Shreya and
Khapra, Mitesh M. and
Kunchukuttan, Anoop and
Kumar, Pratyush",
editor = "Rogers, Anna and
Boyd-Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.693",
doi = "10.18653/v1/2023.acl-long.693",
pages = "12402--12426",
}
@misc{Suriyawongkul_PyThaiNLP_Wisesight_Sentiment_Corpus_2020,
author = {Suriyawongkul, Arthit and
Chuangsuwanich, Ekapol and
Chormai, Pattarawat and
Chantarapratin, Nitchakarn and
Prasertsom, Ponrawee and
Sawatphol, Jitkapat and
Yamada, Nozomi and
Rutherford, Attapol and
Polpanumas, Charin and
Udomcharoenchaikit, Can},
doi = {10.5281/zenodo.3457446},
license = {CC0-1.0},
month = nov,
publisher = {Zenodo},
title = {{PyThaiNLP/Wisesight Sentiment Corpus with Word Tokenization Label}},
url = {https://doi.org/10.5281/zenodo.3457446},
version = {v1.1},
year = 2024
}
@InProceedings{8573337,
author={Nguyen, Kiet Van and Nguyen, Vu Duc and Nguyen, Phu X. V. and Truong, Tham T. H. and Nguyen, Ngan Luu-Thuy},
booktitle={2018 10th International Conference on Knowledge and Systems Engineering (KSE)},
title={UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis},
year={2018},
volume={},
number={},
pages={19-24},
doi={10.1109/KSE.2018.8573337}
}
@misc{leong2023bhasaholisticsoutheastasian,
title={BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models},
author={Wei Qi Leong and Jian Gang Ngui and Yosephine Susanto and Hamsawardhini Rengarajan and Kengatharaiyer Sarveswaran and William Chandra Tjhi},
year={2023},
eprint={2309.06085},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2309.06085},
}
```
# 东南亚情感分析(SEA Sentiment Analysis)
SEA Sentiment Analysis用于评估模型识别文本情感极性的能力。其数据集采样自多个公开资源:针对印尼语、爪哇语与巽他语,数据取自[NusaX](https://aclanthology.org/2023.eacl-main.57/);针对泰米尔语,数据取自[IndicSentiment](https://aclanthology.org/2023.acl-long.693);针对泰语,数据取自[Wisesight Sentiment](https://doi.org/10.5281/zenodo.3457446);针对越南语,数据取自[UIT-VSFC](https://www.researchgate.net/publication/329645066_UIT-VSFC_Vietnamese_Students%27_Feedback_Corpus_for_Sentiment_Analysis)。
### 支持任务与评测榜
SEA Sentiment Analysis专为评估对话式或指令微调大语言模型(LLM)而设计,隶属于新加坡AI(AI Singapore)推出的[SEA-HELM](https://leaderboard.sea-lion.ai/)评测榜。
### 支持语言
- 印尼语(id)
- 爪哇语(jv)
- 巽他语(su)
- 泰米尔语(ta)
- 泰语(th)
- 越南语(vi)
### 数据集详情
SEA Sentiment Analysis按语言划分数据集,并额外包含少样本(Few-shot)示例的拆分集。以下为本数据集的统计信息,其中Token数量仅指`prompts`列中的文本字符串。
| 数据集拆分 | 样本数量 | GPT-4o Token数 | Gemma 2 Token数 | Llama 3 Token数 |
|-|:-|:-|:-|:-|
| 印尼语(id) | 400 | 15131 | 13918 | 19274
| 爪哇语(jv) | 394 | 16731 | 17453 | 20638
| 巽他语(su) | 394 | 17123 | 18632 | 22056
| 泰米尔语(ta) | 1000 | 54038 | 71449 | 211075
| 泰语(th) | 1000 | 38252 | 38111 | 4444
| 越南语(vi) | 1000 | 16732 | 16307 | 16755
| 印尼语少样本(id_fewshot) | 5 | 145 | 137 | 175
| 爪哇语少样本(jv_fewshot) | 5 | 201 | 219 | 255
| 巽他语少样本(su_fewshot) | 5 | 201 | 217 | 253
| 泰米尔语少样本(ta_fewshot) | 5 | 192 | 264 | 792
| 泰语少样本(th_fewshot) | 5 | 50 | 54 | 63
| 越南语少样本(vi_fewshot) | 5 | 87 | 87 | 90
| **总计** | 4218 | 158883 | 176848 | 335873
### 数据来源
| 数据来源 | 授权协议 | 支持语言 | 对应拆分 |
|-|:-|:-| :-|
| [NusaX-Senti](https://huggingface.co/datasets/indonlp/NusaX-senti) | [知识共享署名-相同方式共享4.0协议(CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/) | 印尼语、爪哇语、巽他语 | id, id_fewshot, jv, jv_fewshot, su, su_fewshot
| [IndicSentiment](https://huggingface.co/datasets/ai4bharat/IndicQA) | [知识共享署名4.0协议(CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) | 泰米尔语 | ta, ta_fewshot
| [Wisesight Sentiment](https://github.com/PyThaiNLP/wisesight-sentiment) | [知识共享零许可1.0协议(CC0 1.0)](https://creativecommons.org/publicdomain/zero/1.0/) | 泰语 | th, th_fewshot
| [UIT-VSFC](https://huggingface.co/datasets/uitnlp/vietnamese_students_feedback) | - | 越南语 | vi, vi_fewshot
### 授权协议
本数据集各子数据集的授权协议请参阅上方数据来源表格。我们致力于确保所用数据符合合规要求,所选数据集均来自具备排除受版权保护或争议数据流程的创作者。
## 致谢
本项目获得新加坡国家研究基金会及新加坡资讯通信媒体发展局(IMDA)的国家大语言模型资助计划支持。
### 参考文献
bibtex
@inproceedings{winata-etal-2023-nusax,
title = "NusaX:面向10种印尼地方语言的多语言平行情感数据集",
author = "Winata, Genta Indra and
Aji, Alham Fikri and
Cahyawijaya, Samuel and
Mahendra, Rahmad and
Koto, Fajri and
Romadhony, Ade and
Kurniawan, Kemal and
Moeljadi, David and
Prasojo, Radityo Eko and
Fung, Pascale and
Baldwin, Timothy and
Lau, Jey Han and
Sennrich, Rico and
Ruder, Sebastian",
editor = "Vlachos, Andreas and
Augenstein, Isabelle",
booktitle = "第17届欧洲计算语言学协会分会会议论文集",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.eacl-main.57",
doi = "10.18653/v1/2023.eacl-main.57",
pages = "815--834",
}
@inproceedings{doddapaneni-etal-2023-towards,
title = "迈向无遗漏的印度语言:构建印度语言单语语料库、评测基准与模型",
author = "Doddapaneni, Sumanth and
Aralikatte, Rahul and
Ramesh, Gowtham and
Goyal, Shreya and
Khapra, Mitesh M. and
Kunchukuttan, Anoop and
Kumar, Pratyush",
editor = "Rogers, Anna and
Boyd-Graber, Jordan and
Okazaki, Naoaki",
booktitle = "第61届国际计算语言学协会年会论文集(第1卷:长文)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.693",
doi = "10.18653/v1/2023.acl-long.693",
pages = "12402--12426",
}
@misc{Suriyawongkul_PyThaiNLP_Wisesight_Sentiment_Corpus_2020,
author = {Suriyawongkul, Arthit and
Chuangsuwanich, Ekapol and
Chormai, Pattarawat and
Chantarapratin, Nitchakarn and
Prasertsom, Ponrawee and
Sawatphol, Jitkapat and
Yamada, Nozomi and
Rutherford, Attapol and
Polpanumas, Charin and
Udomcharoenchaikit, Can},
doi = {10.5281/zenodo.3457446},
license = {CC0-1.0},
month = nov,
publisher = {Zenodo},
title = "PyThaiNLP/Wisesight情感语料库带词分词标签",
url = {https://doi.org/10.5281/zenodo.3457446},
version = {v1.1},
year = 2024
}
@InProceedings{8573337,
author={Nguyen, Kiet Van and Nguyen, Vu Duc and Nguyen, Phu X. V. and Truong, Tham T. H. and Nguyen, Ngan Luu-Thuy},
booktitle={2018 10th International Conference on Knowledge and Systems Engineering (KSE)},
title={UIT-VSFC:面向情感分析的越南学生反馈语料库},
year={2018},
volume={},
number={},
pages={19-24},
doi={10.1109/KSE.2018.8573337}
}
@misc{leong2023bhasaholisticsoutheastasian,
title={BHASA:面向大语言模型的东南亚语言与文化综合评测套件},
author={Wei Qi Leong and Jian Gang Ngui and Yosephine Susanto and Hamsawardhini Rengarajan and Kengatharaiyer Sarveswaran and William Chandra Tjhi},
year={2023},
eprint={2309.06085},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2309.06085},
}
提供机构:
maas
创建时间:
2025-11-25



