khalidrizki/indonesian-wiki-chunked-180tok
收藏Hugging Face2025-09-19 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/khalidrizki/indonesian-wiki-chunked-180tok
下载链接
链接失效反馈官方服务:
资源简介:
mr-tydi-indonesian-chunked-180tok数据集是从mr-tydi-corpus的印尼语部分进行预处理后得到的。预处理步骤包括文本规范化(清除Parsoid标记、HTML标签等)、基于词汇的分割(使用google/flan-t5-base tokenizer,每个片段不超过180个token,避免在词中间切割,最后不足50个token的片段会与前一个片段合并),并生成新的docid格式。原始数据集包含大约147万文档,处理后约为152万文档,每个文档片段长度在50到180个token之间。
The mr-tydi-indonesian-chunked-180tok dataset is a preprocessed part of the Indonesian corpus from the mr-tydi-corpus. The preprocessing steps include text normalization (cleaning Parsoid markup, HTML tags, etc.), word-aware chunking (using the google/flan-t5-base tokenizer, segments up to 180 tokens long, avoiding cutting subwords in the middle, and merging the last segment if less than 50 tokens with the previous one), and generating a new docid format. The original dataset contains approximately 1.47 million documents, and after processing, it consists of about 1.52 million documents, with each segment ranging from 50 to 180 tokens in length.
提供机构:
khalidrizki



