five

NLR-NLI

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/aisingapore/NLR-NLI
下载链接
链接失效反馈
官方服务:
资源简介:
# SEA Abstractive Summarization SEA Abstractive Summarization evaluates a model's ability to read a document, identify the key points within, and summarize them into a coherent and fluent text while paraphrasing the document. It is sampled from [IndoNLI](https://aclanthology.org/2021.emnlp-main.821) for Indonesian, [IndicXNLI](https://aclanthology.org/2022.emnlp-main.755/) for Tamil, and [XNLI](https://aclanthology.org/D18-1269/) for Thai and Vietnamese. ### Supported Tasks and Leaderboards SEA Abstractive Summarization is designed for evaluating chat or instruction-tuned large language models (LLMs). It is part of the [SEA-HELM](https://leaderboard.sea-lion.ai/) leaderboard from [AI Singapore](https://aisingapore.org/). ### Languages - Indonesian (id) - Tamil (ta) - Thai (th) - Vietnamese (vi) ### Dataset Details SEA Abstractive Summarization is split by language, with additional splits containing fewshot examples. Below are the statistics for this dataset. The number of tokens only refer to the strings of text found within the `prompts` column. | Split | # of examples | # of GPT-4o tokens | # of Gemma 2 tokens | # of Llama 3 tokens | |-|:-|:-|:-|:-| | id | 1000 | 48864 | 46813 | 61750 | ta | 1000 | 61925 | 83420 | 245601 | th | 1000 | 61000 | 57695 | 71124 | vi | 1000 | 49181 | 47982 | 48960 | id_fewshot | 5 | 209 | 191 | 261 | ta_fewshot | 5 | 365 | 507 | 1495 | th_fewshot | 5 | 325 | 321 | 362 | vi_fewshot | 5 | 260 | 257 | 258 | **total** | 4020 | 222129 | 237186 | 429811 | ### Data Sources | Data Source | License | Language/s | Split/s |-|:-|:-| :-| | [IndoNLI](https://huggingface.co/datasets/afaji/indonli) | [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) | Indonesian | id, id_fewshot | [IndicXNLI](https://huggingface.co/datasets/Divyanshu/indicxnli) | [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) | Tamil | ta, ta_fewshot | [XNLI](https://huggingface.co/datasets/facebook/xnli) | [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) | Thai, Vietnamese | th, th_fewshot, vi, vi_fewshot ### License For the license/s of the dataset/s, please refer to the data sources table above. We endeavor to ensure data used is permissible and have chosen datasets from creators who have processes to exclude copyrighted or disputed data. ## Acknowledgement This project is supported by the National Research Foundation Singapore and Infocomm Media Development Authority (IMDA), Singapore under its National Large Language Model Funding Initiative. ### References ```bibtex @inproceedings{mahendra-etal-2021-indonli, title = "{I}ndo{NLI}: A Natural Language Inference Dataset for {I}ndonesian", author = "Mahendra, Rahmad and Aji, Alham Fikri and Louvan, Samuel and Rahman, Fahrurrozi and Vania, Clara", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.821", pages = "10511--10527", } @misc{aggarwal2022indicxnlievaluatingmultilingualinference, title={IndicXNLI: Evaluating Multilingual Inference for Indian Languages}, author={Divyanshu Aggarwal and Vivek Gupta and Anoop Kunchukuttan}, year={2022}, eprint={2204.08776}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2204.08776}, } @InProceedings{conneau2018xnli, author = {Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin}, title = {XNLI: Evaluating Cross-lingual Sentence Representations}, booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}, year = {2018}, publisher = {Association for Computational Linguistics}, location = {Brussels, Belgium}, } @misc{leong2023bhasaholisticsoutheastasian, title={BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models}, author={Wei Qi Leong and Jian Gang Ngui and Yosephine Susanto and Hamsawardhini Rengarajan and Kengatharaiyer Sarveswaran and William Chandra Tjhi}, year={2023}, eprint={2309.06085}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2309.06085}, } ```

# SEA 摘要生成任务(SEA Abstractive Summarization) SEA 摘要生成任务用于评估模型阅读理解文档、识别核心要点,并在改写原文的基础上,将要点整合为连贯流畅的文本摘要的能力。该数据集的印尼语子集源自[IndoNLI](https://aclanthology.org/2021.emnlp-main.821),泰米尔语子集源自[IndicXNLI](https://aclanthology.org/2022.emnlp-main.755/),泰语和越南语子集源自[XNLI](https://aclanthology.org/D18-1269/)。 ### 支持任务与评测榜单 SEA 摘要生成任务专为评估对话式或指令微调大语言模型(LLMs)而设计,隶属于新加坡人工智能实验室(AI Singapore)推出的[SEA-HELM](https://leaderboard.sea-lion.ai/)评测榜单。 ### 支持语言 - 印尼语(id) - 泰米尔语(ta) - 泰语(th) - 越南语(vi) ### 数据集详情 SEA 摘要生成任务按语言划分数据集子集,此外还包含带有少样本示例的拆分子集。以下为该数据集的统计信息,其中Token数量仅统计`prompts`列中的文本字符串。 | 数据集拆分 | 示例数量 | GPT-4o Token数 | Gemma 2 Token数 | Llama 3 Token数 | |:-|:-|:-|:-|:-| | id | 1000 | 48864 | 46813 | 61750 | ta | 1000 | 61925 | 83420 | 245601 | th | 1000 | 61000 | 57695 | 71124 | vi | 1000 | 49181 | 47982 | 48960 | id_fewshot | 5 | 209 | 191 | 261 | ta_fewshot | 5 | 365 | 507 | 1495 | th_fewshot | 5 | 325 | 321 | 362 | vi_fewshot | 5 | 260 | 257 | 258 | **total** | 4020 | 222129 | 237186 | 429811 | ### 数据来源 | 数据来源 | 许可协议 | 支持语言 | 对应数据集拆分 | |:-|:-|:-|:-| | [IndoNLI](https://huggingface.co/datasets/afaji/indonli) | [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) | 印尼语 | id、id_fewshot | [IndicXNLI](https://huggingface.co/datasets/Divyanshu/indicxnli) | [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) | 泰米尔语 | ta、ta_fewshot | [XNLI](https://huggingface.co/datasets/facebook/xnli) | [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) | 泰语、越南语 | th、th_fewshot、vi、vi_fewshot ### 许可协议 有关本数据集的许可协议,请参阅上文的数据来源表格。 本项目致力于确保所用数据的合规性,仅选用那些具备流程以排除受版权保护或存在争议的数据的创作者所提供的数据集。 ## 致谢 本项目获得新加坡国家研究基金会及新加坡资讯通信媒体发展局(IMDA)旗下国家大语言模型资助计划的支持。 ### 参考文献 bibtex @inproceedings{mahendra-etal-2021-indonli, title = "{I}ndo{NLI}: 面向印尼语的自然语言推理数据集", author = "Mahendra, Rahmad and Aji, Alham Fikri and Louvan, Samuel and Rahman, Fahrurrozi and Vania, Clara", booktitle = "2021年自然语言处理经验方法会议论文集", month = nov, year = "2021", address = "线上及多米尼加共和国蓬塔卡纳", publisher = "计算语言学协会", url = "https://aclanthology.org/2021.emnlp-main.821", pages = "10511--10527", } @misc{aggarwal2022indicxnlievaluatingmultilingualinference, title={IndicXNLI:面向印度语言的多语言推理评测数据集}, author={Divyanshu Aggarwal and Vivek Gupta and Anoop Kunchukuttan}, year={2022}, eprint={2204.08776}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2204.08776}, } @InProceedings{conneau2018xnli, author = {Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel R. and Schwenk, Holger and Stoyanov, Veselin}, title = {XNLI:跨语言句子表征评测}, booktitle = {2018年自然语言处理经验方法会议论文集}, year = {2018}, publisher = {计算语言学协会}, location = {比利时布鲁塞尔}, } @misc{leong2023bhasaholisticsoutheastasian, title={BHASA:面向大语言模型的东南亚语言与文化综合评测套件}, author={Wei Qi Leong and Jian Gang Ngui and Yosephine Susanto and Hamsawardhini Rengarajan and Kengatharaiyer Sarveswaran and William Chandra Tjhi}, year={2023}, eprint={2309.06085}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2309.06085}, }
提供机构:
maas
创建时间:
2025-11-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作