five

cea-list-ia/Manu-FineWeb

收藏
Hugging Face2026-02-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cea-list-ia/Manu-FineWeb
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - fill-mask - text-generation language: - en tags: - manufacturing - engineering size_categories: - 1M<n<10M --- # Manu-FineWeb **Manu-FineWeb** is a high-quality, large-scale corpus specifically curated for the **manufacturing domain**. It was extracted from the 15-trillion-token FineWeb dataset and refined to facilitate efficient domain-specific pretraining for models like **ManufactuBERT**. ## Dataset Summary - **Developed by:** Robin Armingaud and Romaric Besançon (Université Paris-Saclay, CEA, List) - **Statistics:** 2B tokens/4,5 million documents ## Construction & Curation The dataset was built using a rigorous pipeline to ensure high relevance and low redundancy: ### 1. Domain-Specific Filtering A **fastText classifier** was trained on a positive set of manufacturing-specific sources to filter the general FineWeb corpus. The training sources included: * **Elsevier:** Abstracts from industrial and manufacturing engineering journals. * **ArXiv:** Abstracts from categories like physics, computer science, and engineering related to industrial processes. * **Wikipedia:** Articles from manufacturing and engineering categories. * **BigPatent:** Patent descriptions containing "manufacturing" keywords. ### 2. Multi-Stage Deduplication To improve training efficiency, the 10B token corpus was reduced by ~80% through: * **Lexical Deduplication (MinHash):** Eliminating near-exact text duplicates. * **Semantic Deduplication (SemDeDup):** Identifying and removing semantically redundant documents using sentence embeddings (all-MiniLM-L6-v2), leaving only the most representative data points. ## Citation If you use ManufactuBERT in your research, please cite: ```bibtex @misc{armingaud2025manufactubertefficientcontinualpretraining, title={ManufactuBERT: Efficient Continual Pretraining for Manufacturing}, author={Robin Armingaud and Romaric Besançon}, year={2025}, eprint={2511.05135}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2511.05135}, } ```
提供机构:
cea-list-ia
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作