five

finepdfs-edu

收藏
魔搭社区2026-01-09 更新2025-11-15 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceFW/finepdfs-edu
下载链接
链接失效反馈
官方服务:
资源简介:
# 📚 FinePDFs-Edu ![FinePDFs](https://cdn-uploads.huggingface.co/production/uploads/626ede24d2fa9e7d598c8709/dgGeCo6yfZvThn-Fc6Q8k.png) > 350B+ of highly educational tokens from PDFs 📄 ## What is it? 📚 FinePDFs-Edu dataset consists of **350B+ tokens** of educational PDFs filtered from 📄 [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) dataset covering 69 languages. FinePDFs was created using the formula inspired from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), we developed an [educational quality classifier](HuggingFaceFW/finepdfs_edu_classifier_eng_Latn) using annotations generated by Qwen3-235B-A22B-Instruct-2507 for each of 69 languages present in this dataset. We then used this classifier to retain only the most educational web pages. FinePDFs-Edu outperforms FinePDFs on popular benchmarks and shows the power of classifiers trained on synthetic data. The [Dataset Curation](https://huggingface.co/datasets/HuggingFaceFW/finepdfs_edu#dataset-curation) section details the process for creating the dataset. While it might seem that the dataset is an order of magnitude smaller than FineWeb-Edu, unlike its web ancestor, this dataset is globally deduplicated! ![datasets_comparison_edu](https://cdn-uploads.huggingface.co/production/uploads/626ede24d2fa9e7d598c8709/ivVKeFDP2J2MAyQL9s4xy.png) ## What is being released? Along with the dataset, which includes all filtered CommonCrawl dumps since `CC-MAIN-2013-20` to `CC-MAIN-2025-08`, we also release: - The [educational classifier](https://huggingface.co/HuggingFaceFW/finepdfs_edu_classifier_eng_Latn) used for the filtering (for each language) - The [dataset](https://huggingface.co/datasets/HuggingFaceFW/finepdfs_eng_Latn_labeled) with educational (and 3 other) labels by Qwen3-235B-A22B-Instruct-2507 for English. - The [dataset](HuggingFaceFW/finepdfs_fw_edu_labeled) with educational labels by Qwen3-235B-A22B-Instruct-2507 for 69 languages beyond English. - The [code](https://github.com/huggingface/finepdfs) for training it and running inference. ## How to download and use 📄 FinePDFs-Edu See the tables above for the `subset` of the language you want to download. We currently do not provide smaller `sample` versions, but by setting `limit` or using `streaming=True` you can easily fetch a sample of the data. If there is interest from the community we might upload smaller sampled versions later on. ### Using 🏭 [`datatrove`](https://github.com/huggingface/datatrove/) ```python from datatrove.pipeline.readers import ParquetReader # limit determines how many documents will be streamed (remove for all) # this will fetch the Portuguese filtered data data_reader = ParquetReader("hf://datasets/HuggingFaceFW/finepdfs-edu/data/por_Latn/train", limit=1000) for document in data_reader(): # do something with document print(document) ############################### # OR for a processing pipeline: ############################### from datatrove.executor import LocalPipelineExecutor from datatrove.pipeline.readers import ParquetReader from datatrove.pipeline.filters import LambdaFilter from datatrove.pipeline.writers import JsonlWriter pipeline_exec = LocalPipelineExecutor( pipeline=[ ParquetReader("hf://datasets/HuggingFaceFW/finepdfs-edu/data/por_Latn/train", limit=1000), LambdaFilter(lambda doc: "hugging" in doc.text), JsonlWriter("some-output-path") ], tasks=10 ) pipeline_exec.run() ``` ### Using `huggingface_hub` ```python from huggingface_hub import snapshot_download folder = snapshot_download( "HuggingFaceFW/finepdfs-edu", repo_type="dataset", local_dir="./finepdfs-edu/", # download the Czech filtered allow_patterns=["data/ces_Latn/train/*"]) ``` For faster downloads, make sure to install `pip install huggingface_hub[hf_transfer]` and set the environment variable `HF_HUB_ENABLE_HF_TRANSFER=1`. ### Using `datasets` ```python from datasets import load_dataset # get Croatian data fw = load_dataset("HuggingFaceFW/finepdfs-edu", name="hrv_Latn", split="train", streaming=True) ``` Similiar to original FinePDFs, this dataset contains high amount of language switching samples, we thus recommend using the [filtering function](https://huggingface.co/datasets/HuggingFaceFW/finepdfs#code-switching) if this is not desired. ## Dataset curation We have used the same approach for FineWeb-Edu with minimal adjustments of the prompt. To scale to languages beyond English we decided to train separate classifier for each. ### Educational Scoring We used [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) to score approximately 300,000 FinePDFs samples for educational quality on a 0–5 scale. The final prompt used for scoring is available [here](https://huggingface.co/HuggingFaceFW/finepdfs_edu_classifier_eng_Latn/blob/main/prompt.txt). After experimenting with several prompt variants, we found that the **FineWeb-Edu** prompt yielded the most consistent and reliable results. As in FineWeb-Edu, we observed that highly technical or graduate-level content did not correlate well with the benchmarks we track. However, unlike in FineWeb-Edu, the overall average score was noticeably lower—if we had used a fixed threshold of `score = 3`, only about 2% of samples would have been retained. To address this, we instead selected the **top 10%** of samples based on their education score. | Threshold | Drop Rate | | :-------: | :-------: | | 1 | 0.3028 | | 2 | 0.9451 | | 3 | 0.9802 | | 4 | 0.9906 | | 5 | 0.9987 | We also replaced the teacher model to improve multilingual coverage and take advantage of the better inference efficiency offered by Mixture-of-Experts (MoE) architectures. To identify a suitable model, we aimed for one that was most *“Claude-like”*, i.e., whose scoring behavior most closely matched **Claude Sonnet-4**. We compared models using mean squared error (MSE) on a 10k-sample development set and found that **Qwen3-235B-A22B-Instruct-2507** was both the most Claude-like and highly efficient—processing up to **14 chunks/sec on a single H100 GPU**. | Model | MSE (vs. Sonnet-4) | | :-------------------------------------------- | -----------------: | | Qwen_Qwen3-235B-A22B-Instruct-2507 | **0.398** | | Qwen_Qwen3-235B-A22B-Thinking-2507 | 0.812 | | Qwen_Qwen3-30B-A3B-Instruct-2507 | 0.364 | | Qwen_Qwen3-30B-A3B-Thinking-2507 | 0.925 | | google_gemma-3-27b-it | 2.727 | | meta-llama_Llama-3.3-70B-Instruct | 0.553 | | meta-llama_Llama-4-Maverick-17B-128E-Instruct | 0.707 | | meta-llama_Llama-4-Scout-17B-16E-Instruct | 1.177 | | mistralai_Magistral-Small-2507 | 0.717 | | zai-org_GLM-4.5-Air-FP8 | 0.510 | For long documents, we take the first 2,048 tokens from the top of the document. If the document exceeds 10,000 characters, we also take the last 2,048 tokens and compute the final score as `max(top_score, bottom_score)`. ### Classifier Training We fine-tuned a BERT-like regression model using these annotations, based on [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) for English and [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) for other languages. Both models achieved the best F1 performance among the options we evaluated, while supporting FA2, which allowed us to label over 220 samples per second on an H100 GPU. For each model, we unfroze both the classifier head and the last four transformer layers. To address severe class imbalance, we rebalanced the training data. The resulting classifiers are available at: `https://huggingface.co/HuggingFaceFW/finepdfs_edu_classifier_{lang}` ### Filtering and results We then built 📚 FinePDFs-Edu by filtering out 90% of samples with lowest edu score for each language. Our ablation demonstrated that this refined dataset surpasses 📄 FinePDFs and all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU and ARC. You will find all the ablation models and datasets in [this collection](https://huggingface.co/collections/HuggingFaceFW/finepdfs). ## Considerations for Using the Data See: [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs). ## Additional Information ### Licensing Information The dataset is released under the **Open Data Commons Attribution License (ODC-By) v1.0** [license](https://opendatacommons.org/licenses/by/1-0/). The use of this dataset is also subject to [CommonCrawl's Terms of Use](https://commoncrawl.org/terms-of-use). ## Citation Information ``` @misc{kydlicek2025finepdfs, title={FinePDFs}, author={Hynek Kydl{\'\i}{\v{c}}ek and Guilherme Penedo and Leandro von Werra}, year={2025}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {\url{https://huggingface.co/datasets/HuggingFaceFW/finepdfs_edu}} } ```

# 📚 FinePDFs-Edu ![FinePDFs](https://cdn-uploads.huggingface.co/production/uploads/626ede24d2fa9e7d598c8709/dgGeCo6yfZvThn-Fc6Q8k.png) > 源自PDF文档的3500亿+高质量教育类Token 📄 ## 数据集概述 📚 FinePDFs-Edu 数据集由**3500亿+ Token**的教育类PDF文档组成,这些文档从覆盖69种语言的 📄 [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) 数据集中筛选得到。 FinePDFs 的构建灵感源自 [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu),我们针对本数据集涵盖的69种语言,分别使用由 Qwen3-235B-A22B-Instruct-2507 生成的标注数据,训练了一款**教育质量分类器(educational quality classifier)**。随后我们借助该分类器仅保留教育价值最高的网页内容。FinePDFs-Edu 在主流基准测试上的表现优于FinePDFs,验证了基于合成数据训练的分类器的有效性。 数据集构建流程详见 [Dataset Curation](https://huggingface.co/datasets/HuggingFaceFW/finepdfs_edu#dataset-curation) 章节。尽管该数据集的规模看起来比FineWeb-Edu小一个数量级,但与其网页版前身不同,本数据集已完成全局去重! ![datasets_comparison_edu](https://cdn-uploads.huggingface.co/production/uploads/626ede24d2fa9e7d598c8709/ivVKeFDP2J2MAyQL9s4xy.png) ## 本次发布内容 本次发布的内容除覆盖`CC-MAIN-2013-20`至`CC-MAIN-2025-08`所有筛选后的CommonCrawl快照的数据集外,还包含: - 用于筛选的**教育质量分类器(educational quality classifier)**(针对每种语言,链接:https://huggingface.co/HuggingFaceFW/finepdfs_edu_classifier_eng_Latn) - 针对英语、由Qwen3-235B-A22B-Instruct-2507生成的包含教育类标签(及其他3类标签)的数据集,链接:https://huggingface.co/datasets/HuggingFaceFW/finepdfs_eng_Latn_labeled - 针对69种非英语语言、由Qwen3-235B-A22B-Instruct-2507生成的包含教育类标签的数据集,链接:HuggingFaceFW/finepdfs_fw_edu_labeled - 用于训练分类器与执行推理的代码,链接:https://github.com/huggingface/finepdfs ## 如何下载与使用 📄 FinePDFs-Edu 请根据上文表格选择需要下载的语言`子集`。目前我们未提供精简`采样版本`,但可通过设置`limit`参数或使用`streaming=True`轻松获取数据采样。若社区有相关需求,我们后续可能会上传精简采样版本。 ### 使用 🏭 [`datatrove`](https://github.com/huggingface/datatrove/) python from datatrove.pipeline.readers import ParquetReader # limit 用于指定流式加载的文档数量(移除该参数可加载全部数据) # 以下示例将加载葡萄牙语筛选后的数据 data_reader = ParquetReader("hf://datasets/HuggingFaceFW/finepdfs-edu/data/por_Latn/train", limit=1000) for document in data_reader(): # 对获取到的文档执行自定义操作 print(document) ############################### # 或使用处理流水线模式: ############################### from datatrove.executor import LocalPipelineExecutor from datatrove.pipeline.readers import ParquetReader from datatrove.pipeline.filters import LambdaFilter from datatrove.pipeline.writers import JsonlWriter pipeline_exec = LocalPipelineExecutor( pipeline=[ ParquetReader("hf://datasets/HuggingFaceFW/finepdfs-edu/data/por_Latn/train", limit=1000), LambdaFilter(lambda doc: "hugging" in doc.text), JsonlWriter("some-output-path") ], tasks=10 ) pipeline_exec.run() ### 使用 `huggingface_hub` python from huggingface_hub import snapshot_download folder = snapshot_download( "HuggingFaceFW/finepdfs-edu", repo_type="dataset", local_dir="./finepdfs-edu/", # 下载捷克语筛选后的数据 allow_patterns=["data/ces_Latn/train/*"]) 如需加速下载,请确保安装`pip install huggingface_hub[hf_transfer]`并设置环境变量`HF_HUB_ENABLE_HF_TRANSFER=1`。 ### 使用 `datasets` python from datasets import load_dataset # 获取克罗地亚语数据 fw = load_dataset("HuggingFaceFW/finepdfs-edu", name="hrv_Latn", split="train", streaming=True) 与原始FinePDFs类似,本数据集包含大量语言切换样本,若无需此类样本,建议使用[过滤函数](https://huggingface.co/datasets/HuggingFaceFW/finepdfs#code-switching)进行处理。 ## 数据集构建流程 我们采用与FineWeb-Edu一致的构建思路,仅对提示词做了小幅调整。为实现多语言扩展,我们决定针对每种语言分别训练专属分类器。 ### 教育质量评分 我们使用 [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) 对约30万个FinePDFs样本进行0至5分制的教育质量评分。本次评分使用的最终提示词可在此处获取:[https://huggingface.co/HuggingFaceFW/finepdfs_edu_classifier_eng_Latn/blob/main/prompt.txt](https://huggingface.co/HuggingFaceFW/finepdfs_edu_classifier_eng_Latn/blob/main/prompt.txt)。 在尝试多种提示词变体后,我们发现**FineWeb-Edu**提示词能生成最一致且可靠的评分结果。与FineWeb-Edu的情况一致,我们发现高技术性或研究生级别的内容与我们追踪的基准测试相关性较低。但与FineWeb-Edu不同的是,本数据集的整体平均得分明显更低:若使用固定阈值`score=3`,仅约2%的样本会被保留。为此,我们改为根据教育质量评分选择**前10%**的样本。 | 阈值 | 丢弃率 | | :-------: | :-------: | | 1 | 0.3028 | | 2 | 0.9451 | | 3 | 0.9802 | | 4 | 0.9906 | | 5 | 0.9987 | 为提升多语言覆盖范围并利用混合专家架构(Mixture-of-Experts, MoE)更优的推理效率,我们更换了教师模型。为筛选合适的模型,我们寻找评分行为最接近**Claude Sonnet-4**的“类Claude”模型。我们基于1万样本的开发集,使用均方误差(MSE)对模型进行评估,最终发现**Qwen3-235B-A22B-Instruct-2507**不仅最接近Claude Sonnet-4的评分表现,且效率极高——在单张H100 GPU上每秒可处理多达14个文本块。 | 模型 | 与Sonnet-4的MSE | | :-------------------------------------------- | -----------------: | | Qwen_Qwen3-235B-A22B-Instruct-2507 | **0.398** | | Qwen_Qwen3-235B-A22B-Thinking-2507 | 0.812 | | Qwen_Qwen3-30B-A3B-Instruct-2507 | 0.364 | | Qwen_Qwen3-30B-A3B-Thinking-2507 | 0.925 | | google_gemma-3-27b-it | 2.727 | | meta-llama_Llama-3.3-70B-Instruct | 0.553 | | meta-llama_Llama-4-Maverick-17B-128E-Instruct | 0.707 | | meta-llama_Llama-4-Scout-17B-16E-Instruct | 1.177 | | mistralai_Magistral-Small-2507 | 0.717 | | zai-org_GLM-4.5-Air-FP8 | 0.510 | 对于长文档,我们截取文档开头的前2048个Token。若文档字符数超过10000,我们还会截取结尾的后2048个Token,并以`max(top_score, bottom_score)`作为最终评分。 ### 分类器训练 我们基于上述标注数据,微调了类BERT的回归模型:针对英语使用 [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large),针对其他语言使用 [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base)。两款模型在我们评估的候选模型中均取得了最优的F1分数,且支持FA2量化,这使得我们在H100 GPU上每秒可标注超过220个样本。 针对每个模型,我们解冻了分类器头与最后四层Transformer层。为解决严重的类别不平衡问题,我们对训练数据进行了重平衡处理。 最终得到的分类器可通过以下链接获取:`https://huggingface.co/HuggingFaceFW/finepdfs_edu_classifier_{lang}`,其中`{lang}`为对应语言的编码。 ### 筛选与结果 我们针对每种语言过滤掉教育评分最低的90%样本,最终构建了📚 FinePDFs-Edu数据集。我们的消融实验表明,该优化后的数据集性能优于📄 FinePDFs与所有其他开源网页数据集,在MMLU、ARC等教育类基准测试中取得了显著提升。所有消融实验用到的模型与数据集均可在[该数据集集合](https://huggingface.co/collections/HuggingFaceFW/finepdfs)中获取。 ## 数据使用注意事项 详见:[FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs)。 ## 补充信息 ### 许可信息 本数据集采用**开放数据 Commons 署名许可协议(Open Data Commons Attribution License, ODC-By)v1.0**进行授权,详见[https://opendatacommons.org/licenses/by/1.0/](https://opendatacommons.org/licenses/by/1.0/)。使用本数据集需同时遵守[CommonCrawl使用条款](https://commoncrawl.org/terms-of-use)。 ## 引用信息 @misc{kydlicek2025finepdfs, title={FinePDFs}, author={Hynek Kydlíček and Guilherme Penedo and Leandro von Werra}, year={2025}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {url{https://huggingface.co/datasets/HuggingFaceFW/finepdfs_edu}} }
提供机构:
maas
创建时间:
2025-11-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作