NusaWrites
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/IndoNLP/nusa-writes
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为NusaWrites,包含了印度尼西亚使用的12种代表性不足且资源极其匮乏的语言。通过让母语者撰写段落,确保了词汇多样性和文化内容的质量。该数据集旨在改善低资源语言的NLP技术获取途径,并作为评估语言模型的基准,涉及的自然语言处理任务包括自然语言理解(NLU)和自然语言生成(NLG)。
The dataset named NusaWrites includes 12 underrepresented and severely under-resourced languages used in Indonesia. It ensures lexical diversity and high-quality cultural content by having native speakers write paragraphs. This dataset aims to improve access to NLP technologies for low-resource languages and serves as a benchmark for evaluating language models, covering natural language processing tasks including Natural Language Understanding (NLU) and Natural Language Generation (NLG).
提供机构:
IndoNLP



