rinnieyoung/sea-javanese-cleaned-parquet-v1
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/rinnieyoung/sea-javanese-cleaned-parquet-v1
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个经过清洗的爪哇语预训练语料库,以Hugging Face parquet格式导出。数据集来源于多个公开资源,包括HuggingFaceFW/fineweb-2、allenai/c4和afrizalha/Centhini-1-Javanese。数据集经过了基本的文本清洗、短文本过滤、重复过滤、基于规则的噪声过滤以及文档级别的精确去重。数据集包含训练语料库、源级过滤摘要和源清单。运行总结显示,数据集包含734,489条记录,保留率为56.69%。评估指标包括数据级指标、去重指标和手动质量控制指标。
This dataset is a cleaned Javanese pretraining corpus exported in Hugging Face parquet format. Current public sources used in this release include HuggingFaceFW/fineweb-2, allenai/c4, and afrizalha/Centhini-1-Javanese. The dataset has undergone basic text cleaning, short-text filtering, repetition filtering, rule-based noise filtering, and document-level exact deduplication. The dataset includes the cleaned merged training corpus, source-level filtering summary, and source inventory. The run summary shows that the dataset contains 734,489 records with a keep rate of 56.69%. Evaluation metrics include data-level metrics, deduplication metrics, and manual QC metrics.
提供机构:
rinnieyoung



