five

NLG-Abstractive-Summarization

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/aisingapore/NLG-Abstractive-Summarization
下载链接
链接失效反馈
官方服务:
资源简介:
# SEA Abstractive Summarization SEA Abstractive Summarization evaluates a model's ability to read a document, identify the key points within, and summarize them into a coherent and fluent text while paraphrasing the document. It is sampled from [XL-Sum](https://aclanthology.org/2021.findings-acl.413/) for Indonesian, Tamil, Thai, and Vietnamese. ### Supported Tasks and Leaderboards SEA Abstractive Summarization is designed for evaluating chat or instruction-tuned large language models (LLMs). It is part of the [SEA-HELM](https://leaderboard.sea-lion.ai/) leaderboard from [AI Singapore](https://aisingapore.org/). ### Languages - Indonesian (id) - Tamil (ta) - Thai (th) - Vietnamese (vi) ### Dataset Details SEA Abstractive Summarization is split by language, with additional splits containing fewshot examples. Below are the statistics for this dataset. The number of tokens only refer to the strings of text found within the `prompts` column. | Split | # of examples | # of GPT-4o tokens | # of Gemma 2 tokens | # of Llama 3 tokens | |-|:-|:-|:-|:-| | id | 100 | 61628 | 55485 | 77016 | | ta | 100 | 114275 | 156476 | 457559 | | th | 100 | 155203 | 151988 | 176985 | | vi | 100 | 86305 | 78285 | 82269 | | id_fewshot | 5 | 1124 | 1050 | 1430 | | ta_fewshot | 5 | 964 | 1339 | 3905 | | th_fewshot | 5 | 925 | 869 | 1062 | | vi_fewshot | 5 | 2396 | 2170 | 2282 | | **total** | 420 | 422820 | 447662 | 802508 | ### Data Sources | Data Source | License | Language/s | Split/s |-|:-|:-| :-| | [XL-Sum](https://huggingface.co/datasets/csebuetnlp/xlsum) | [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) | Indonesian, Tamil, Thai, Vietnamese | id, id_fewshot, ta, ta_fewshot, th, th_fewshot, vi, vi_fewshot ### License For the license/s of the dataset/s, please refer to the data sources table above. We endeavor to ensure data used is permissible and have chosen datasets from creators who have processes to exclude copyrighted or disputed data. ## Acknowledgement This project is supported by the National Research Foundation Singapore and Infocomm Media Development Authority (IMDA), Singapore under its National Large Language Model Funding Initiative. ### References ```bibtex @inproceedings{hasan-etal-2021-xl, title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages", author = "Hasan, Tahmid and Bhattacharjee, Abhik and Islam, Md. Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M. Sohel and Shahriyar, Rifat", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.413", pages = "4693--4703", } @misc{leong2023bhasaholisticsoutheastasian, title={BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models}, author={Wei Qi Leong and Jian Gang Ngui and Yosephine Susanto and Hamsawardhini Rengarajan and Kengatharaiyer Sarveswaran and William Chandra Tjhi}, year={2023}, eprint={2309.06085}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2309.06085}, } ```

# SEA 抽象式摘要(Abstractive Summarization)数据集 SEA 抽象式摘要数据集用于评估模型阅读文档、识别文档内核心要点,并在改写原文的基础上,将要点整合为连贯流畅的摘要文本的能力。该数据集源自针对印尼语、泰米尔语、泰语与越南语的XL-Sum数据集。 ### 支持任务与评测榜单 SEA 抽象式摘要数据集专为评估对话型或指令微调大语言模型(Large Language Model, LLM)而设计,它是新加坡人工智能(AI Singapore)推出的SEA-HELM评测榜单的组成部分。 ### 支持语言 - 印尼语(id) - 泰米尔语(ta) - 泰语(th) - 越南语(vi) ### 数据集详情 SEA 抽象式摘要数据集按语言划分,同时包含带有少样本(Few-shot)示例的额外划分集。以下为该数据集的统计信息,此处的Token数量仅指代`prompts`列中的文本字符串。 | 划分集 | 样本数量 | GPT-4o 的Token数 | Gemma 2 的Token数 | Llama 3 的Token数 | |:-|:-|:-|:-|:-| | id | 100 | 61628 | 55485 | 77016 | | ta | 100 | 114275 | 156476 | 457559 | | th | 100 | 155203 | 151988 | 176985 | | vi | 100 | 86305 | 78285 | 82269 | | id_fewshot | 5 | 1124 | 1050 | 1430 | | ta_fewshot | 5 | 964 | 1339 | 3905 | | th_fewshot | 5 | 925 | 869 | 1062 | | vi_fewshot | 5 | 2396 | 2170 | 2282 | | **总计** | 420 | 422820 | 447662 | 802508 | ### 数据来源 | 数据来源 | 授权协议 | 支持语言 | 对应划分集 | |:-|:-|:-|:-| | [XL-Sum](https://huggingface.co/datasets/csebuetnlp/xlsum) | [知识共享署名-非商业性使用-相同方式共享4.0协议(CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) | 印尼语、泰米尔语、泰语、越南语 | id、id_fewshot、ta、ta_fewshot、th、th_fewshot、vi、vi_fewshot | ### 授权协议 本数据集的授权协议请参见上文的数据来源表格。我们致力于确保所使用的数据合规,所选数据集均来自具备排除受版权保护或有争议数据流程的创作者。 ## 致谢 本项目获得新加坡国家研究基金会与新加坡资讯通信媒体发展局(IMDA)旗下国家大语言模型资助计划的支持。 ### 参考文献 bibtex @inproceedings{hasan-etal-2021-xl, title = "{XL}-Sum: 面向44种语言的大规模多语言抽象式摘要数据集", author = "Hasan, Tahmid and Bhattacharjee, Abhik and Islam, Md. Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M. Sohel and Shahriyar, Rifat", booktitle = "《计算语言学协会2021年发现:ACL-IJCNLP 2021》", month = "八月", year = "2021", address = "线上", publisher = "计算语言学协会", url = "https://aclanthology.org/2021.findings-acl.413", pages = "4693--4703", } @misc{leong2023bhasaholisticsoutheastasian, title={BHASA:面向大语言模型的东南亚语言与文化综合评测套件}, author={Wei Qi Leong and Jian Gang Ngui and Yosephine Susanto and Hamsawardhini Rengarajan and Kengatharaiyer Sarveswaran and William Chandra Tjhi}, year={2023}, eprint={2309.06085}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2309.06085}, }
提供机构:
maas
创建时间:
2025-11-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作