five

NLU-Question-Answering

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/aisingapore/NLU-Question-Answering
下载链接
链接失效反馈
官方服务:
资源简介:
# SEA Question Answering SEA Question Answering evaluates a model's ability to predict a contiguous span of characters that answers the question about a given passage. It is sampled from [TyDi QA-GoldP](https://aclanthology.org/2020.tacl-1.30/) for Indonesian, [IndicQA](https://aclanthology.org/2023.acl-long.693) for Tamil, and [XQuaD](https://aclanthology.org/2020.acl-main.421) for Thai and Vietnamese. ### Supported Tasks and Leaderboards SEA Question Answering is designed for evaluating chat or instruction-tuned large language models (LLMs). It is part of the [SEA-HELM](https://leaderboard.sea-lion.ai/) leaderboard from [AI Singapore](https://aisingapore.org/). ### Languages - Indonesian (id) - Tamil (ta) - Thai (th) - Vietnamese (vi) ### Dataset Details SEA Question Answering is split by language, with additional splits containing fewshot examples. Below are the statistics for this dataset. The number of tokens only refer to the strings of text found within the `prompts` column. | Split | # of examples | # of GPT-4o tokens | # of Gemma 2 tokens | # of Llama 3 tokens | |-|:-|:-|:-|:-| | id | 100 | 16000 | 15099 | 19380 | ta | 100 | 709785 | 83356 | 110080 | 314181 | th | 100 | 33266 | 33052 | 37164 | vi | 100 | 25064 | 24086 | 23722 | id_fewshot | 5 | 372 | 375 | 466 | ta_fewshot | 5 | 2459 | 3260 | 9165 | th_fewshot | 5 | 781 | 885 | 926 | vi_fewshot | 5 | 574 | 550 | 548 | **total** | 420 | 161872 | 187387 | 405552 | ### Data Sources | Data Source | License | Language/s | Split/s |-|:-|:-| :-| | [TyDi QA-GoldP](https://huggingface.co/datasets/google-research-datasets/tydiqa) | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html) | Indonesian | id, id_fewshot | [IndicQA](https://huggingface.co/datasets/ai4bharat/IndicQA) | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | Tamil | ta, ta_fewshot | [XQUAD](https://github.com/google-deepmind/xquad) | [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) | Thai, Vietnamese | th, th_fewshot, vi, vi_fewshot ### License For the license/s of the dataset/s, please refer to the data sources table above. We endeavor to ensure data used is permissible and have chosen datasets from creators who have processes to exclude copyrighted or disputed data. ## Acknowledgement This project is supported by the National Research Foundation Singapore and Infocomm Media Development Authority (IMDA), Singapore under its National Large Language Model Funding Initiative. ### References ```bibtex @article{tydiqa, title = {TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages}, author = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki} year = {2020}, journal = {Transactions of the Association for Computational Linguistics} } @inproceedings{doddapaneni-etal-2023-towards, title = "Towards Leaving No {I}ndic Language Behind: Building Monolingual Corpora, Benchmark and Models for {I}ndic Languages", author = "Doddapaneni, Sumanth and Aralikatte, Rahul and Ramesh, Gowtham and Goyal, Shreya and Khapra, Mitesh M. and Kunchukuttan, Anoop and Kumar, Pratyush", editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.693", doi = "10.18653/v1/2023.acl-long.693", pages = "12402--12426", } @inproceedings{artetxe-etal-2020-cross, title = "On the Cross-lingual Transferability of Monolingual Representations", author = "Artetxe, Mikel and Ruder, Sebastian and Yogatama, Dani", editor = "Jurafsky, Dan and Chai, Joyce and Schluter, Natalie and Tetreault, Joel", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.acl-main.421", doi = "10.18653/v1/2020.acl-main.421", pages = "4623--4637", } @misc{leong2023bhasaholisticsoutheastasian, title={BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models}, author={Wei Qi Leong and Jian Gang Ngui and Yosephine Susanto and Hamsawardhini Rengarajan and Kengatharaiyer Sarveswaran and William Chandra Tjhi}, year={2023}, eprint={2309.06085}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2309.06085}, } ```

# 东南亚问答数据集(SEA Question Answering) SEA Question Answering 用于评估模型预测与给定篇章相关问题的连续字符跨度答案的能力。该数据集的印尼语子集源自[TyDi QA-GoldP](https://aclanthology.org/2020.tacl-1.30/),泰米尔语子集源自[IndicQA](https://aclanthology.org/2023.acl-long.693),泰语和越南语子集源自[XQuaD](https://aclanthology.org/2020.acl-main.421)。 ### 支持任务与评测榜单 SEA Question Answering 专为评估对话式或指令微调大语言模型(Large Language Model,LLM)而设计,隶属于新加坡人工智能实验室(AI Singapore)推出的[SEA-HELM](https://leaderboard.sea-lion.ai/)评测榜单。 ### 支持语言 - 印尼语(id) - 泰米尔语(ta) - 泰语(th) - 越南语(vi) ### 数据集详情 东南亚问答数据集按语言划分,另有包含少样本示例的子集。以下为该数据集的统计信息,其中Token数量仅统计`prompts`列中的文本字符串。 | 子集划分 | 示例数量 | GPT-4o Token 数 | Gemma 2 Token 数 | Llama 3 Token 数 | |-|:-|:-|:-|:-| | 印尼语(id) | 100 | 16000 | 15099 | 19380 | | 泰米尔语(ta) | 100 | 709785 | 83356 | 110080 | 314181 | | 泰语(th) | 100 | 33266 | 33052 | 37164 | | 越南语(vi) | 100 | 25064 | 24086 | 23722 | | 印尼语少样本子集(id_fewshot) | 5 | 372 | 375 | 466 | | 泰米尔语少样本子集(ta_fewshot) | 5 | 2459 | 3260 | 9165 | | 泰语少样本子集(th_fewshot) | 5 | 781 | 885 | 926 | | 越南语少样本子集(vi_fewshot) | 5 | 574 | 550 | 548 | | **总计** | 420 | 161872 | 187387 | 405552 | ### 数据来源 | 数据来源 | 授权协议 | 支持语言 | 对应子集 | |-|:-|:-|:-| | [TyDi QA-GoldP](https://huggingface.co/datasets/google-research-datasets/tydiqa) | [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html) | 印尼语 | id、id_fewshot | | [IndicQA](https://huggingface.co/datasets/ai4bharat/IndicQA) | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | 泰米尔语 | ta、ta_fewshot | | [XQUAD](https://github.com/google-deepmind/xquad) | [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) | 泰语、越南语 | th、th_fewshot、vi、vi_fewshot | ### 授权协议 本数据集的授权协议请参阅上文的数据来源表格。 我们致力于确保所使用数据的合规性,所选数据集均来自具备剔除受版权保护或争议性数据流程的创作者。 ## 致谢 本项目获得新加坡国家研究基金会及新加坡资讯通信媒体发展局(IMDA)旗下国家大语言模型资助计划的支持。 ### 参考文献 bibtex @article{tydiqa, title = {TyDi QA:面向类型学多样性语言的信息检索问答基准数据集}, author = {Jonathan H. Clark and Eunsol Choi and Michael Collins and Dan Garrette and Tom Kwiatkowski and Vitaly Nikolaev and Jennimaria Palomaki} year = {2020}, journal = {《计算语言学协会会刊》} } @inproceedings{doddapaneni-etal-2023-towards, title = "让每一种印度语言都不被落下:构建印度语言的单语语料库、评测基准与模型", author = "Doddapaneni, Sumanth and Aralikatte, Rahul and Ramesh, Gowtham and Goyal, Shreya and Khapra, Mitesh M. and Kunchukuttan, Anoop and Kumar, Pratyush", editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "第61届国际计算语言学协会年会论文集(卷1:长文)", month = jul, year = "2023", address = "加拿大多伦多", publisher = "国际计算语言学协会", url = "https://aclanthology.org/2023.acl-long.693", doi = "10.18653/v1/2023.acl-long.693", pages = "12402--12426", } @inproceedings{artetxe-etal-2020-cross, title = "论单语表征的跨语言迁移能力", author = "Artetxe, Mikel and Ruder, Sebastian and Yogatama, Dani", editor = "Jurafsky, Dan and Chai, Joyce and Schluter, Natalie and Tetreault, Joel", booktitle = "第58届国际计算语言学协会年会论文集", month = jul, year = "2020", address = "线上", publisher = "国际计算语言学协会", url = "https://aclanthology.org/2020.acl-main.421", doi = "10.18653/v1/2020.acl-main.421", pages = "4623--4637", } @misc{leong2023bhasaholisticsoutheastasian, title={BHASA:面向大语言模型的东南亚语言与文化综合评测套件}, author={Wei Qi Leong and Jian Gang Ngui and Yosephine Susanto and Hamsawardhini Rengarajan and Kengatharaiyer Sarveswaran and William Chandra Tjhi}, year={2023}, eprint={2309.06085}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2309.06085}, }
提供机构:
maas
创建时间:
2025-11-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作