VSMRC
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/vietgpt/wikipedia_vi
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为“越南语文本分割与多选阅读理解数据集”,其数据来源于越南维基百科。该数据集包含了15,942篇文档用于文本分割,以及16,347个人工质量保证生成的合成多选题问答对。为确保数据质量,该数据集采取了严格的质控措施,包括专家验证,并且在问答对生成过程中使用了多种语言模型。规模上,数据集包含了15,942篇分割文档和16,347个问答对,任务涵盖了文本分割和阅读理解。
This dataset is named 'Vietnamese Text Segmentation and Multiple-choice Reading Comprehension Dataset'. Its data is sourced from Vietnamese Wikipedia. It contains 15,942 documents for text segmentation tasks, along with 16,347 synthetic multiple-choice question-answer pairs generated with manual quality assurance. To guarantee data quality, strict quality control measures have been implemented, including expert validation, and multiple large language models were utilized during the generation of these question-answer pairs. In terms of scale, the dataset comprises 15,942 segmented documents and 16,347 question-answer pairs, covering two core tasks: text segmentation and reading comprehension.
提供机构:
Hugging Face



