Web3Survivor/Survivor
收藏Hugging Face2025-12-11 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Web3Survivor/Survivor
下载链接
链接失效反馈官方服务:
资源简介:
FinePDFs-Edu数据集包含来自PDF的3500亿+教育性标记,覆盖69种语言。该数据集是通过从FinePDFs数据集中筛选教育性PDF创建的,使用了基于Qwen3-235B-A22B-Instruct-2507生成注释的教育质量分类器进行筛选。数据集在流行的基准测试中表现优于FinePDFs,并且是全局去重的。数据集遵循ODC-By许可协议。
The FinePDFs-Edu dataset consists of 350B+ tokens of educational PDFs filtered from the FinePDFs dataset, covering 69 languages. It was created using an educational quality classifier based on annotations generated by Qwen3-235B-A22B-Instruct-2507. The dataset outperforms FinePDFs on popular benchmarks and is globally deduplicated. It is released under the ODC-By license.
提供机构:
Web3Survivor



