mesolitica/Malaysian-Normalizer
收藏Hugging Face2025-07-08 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/mesolitica/Malaysian-Normalizer
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个多样化的马来语和英语文本数据集,包含文本、结果和来源信息。数据集分为多个部分,如act_en_news、no_index_en_news等,每个部分包含不同数量的文本例子。数据集还包括了用于文本归一化的Malaysian Normalizer工具和基于Qwen3-8B模型进行伪标签生成的Pseudolabel工具。
This dataset is a diverse collection of Malay and English text data, including text, result, and source information. The dataset is split into multiple sections such as act_en_news, no_index_en_news, etc., each containing a different number of text examples. The dataset also includes the Malaysian Normalizer tool for text normalization and the Pseudolabel tool for generating pseudo-labels based on the Qwen3-8B model.
提供机构:
mesolitica



