textcleanlm/textclean-1M
收藏Hugging Face2025-07-18 更新2025-09-13 收录
下载链接:
https://hf-mirror.com/datasets/textcleanlm/textclean-1M
下载链接
链接失效反馈官方服务:
资源简介:
TextClean-Corpus-1M是一个包含100万个token的预处理网页文本数据集,旨在减少下游应用中的计算成本。该数据集通过使用OpenAI的o4-mini模型移除无关元素,如导航链接、广告等,同时保留了核心信息。数据集来源于EssentialWeb 1.0的随机样本,并转换为Markdown格式。
TextClean-Corpus-1M is a preprocessed web text dataset with 1 million tokens designed to reduce computational costs in downstream applications. It is created by using OpenAIs o4-mini model to remove irrelevant elements such as navigation links and advertisements while preserving the core information. The dataset is sourced from random samples of EssentialWeb 1.0 and is converted into Markdown format.
提供机构:
textcleanlm



