anhndbk/ViWikiBench
收藏Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/anhndbk/ViWikiBench
下载链接
链接失效反馈官方服务:
资源简介:
ViWiki-Bench是一个越南语基准数据集,专门设计用于评估量化大型语言模型(LLMs)在越南语文本上的质量退化情况。它相当于越南语的WikiText-2,遵循相同的连续流方法,可以在任何现有的评估流程中直接替代使用。数据集来源于越南语维基百科2023年11月的完整转储,经过了一系列的清洗和处理流程,包括去除维基标记、解决链接、Unicode NFC规范化、去除节标题和空白规范化等步骤。数据集分为训练、验证和测试三个部分,每个部分都是非重叠的,并且通过固定的种子进行随机化以确保可重复性。ViWiki-Bench为越南语提供了一个本地化的基准测试,解决了现有量化基准仅针对英语的问题,使得评估结果更能反映越南语的实际性能。
ViWiki-Bench is a Vietnamese benchmark dataset specifically designed to evaluate quality degradation of quantized Large Language Models (LLMs) on Vietnamese text. It is the Vietnamese equivalent of WikiText-2, following the same continuous-stream methodology, enabling drop-in replacement in any existing evaluation pipeline. The dataset is sourced from the full Vietnamese Wikipedia dump from November 2023 and undergoes a series of cleaning and processing steps, including removing Wiki markup, resolving links, Unicode NFC normalization, removing section headers, and whitespace normalization. The dataset is divided into train, validation, and test splits, all non-overlapping and randomized with a fixed seed for reproducibility. ViWiki-Bench provides a Vietnamese-native ground truth to measure the performance of quantized models on Vietnamese text, addressing the limitation of existing benchmarks that are English-only.
提供机构:
anhndbk



