five

openbmb/Ultra-FineWeb-L3

收藏
Hugging Face2026-02-09 更新2026-03-21 收录
下载链接:
https://hf-mirror.com/datasets/openbmb/Ultra-FineWeb-L3
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - zh license: apache-2.0 task_categories: - text-generation pretty_name: Ultra-FineWeb-L3 tags: - llm - pretraining - web-data - data-synthesis - high-quality configs: - config_name: ultrafineweb_en_l3 data_files: "data/ultrafineweb_en_l3/*.jsonl" - config_name: ultrafineweb_zh_l3 data_files: "data/ultrafineweb_zh_l3/*.jsonl" default_config_name: ultrafineweb_en_l3 --- # Ultra-FineWeb-L3 Ultra-FineWeb-L3 is a high-quality refined web pre-training dataset, produced through multi-format synthesis and rewriting based on the [UltraData](https://ultradata.openbmb.cn/blog/position-paper) L0-L4 Tiered Data Management Framework. ## 📚 Overview Starting from quality-selected web data ([Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)), we apply LLM-driven synthesis and refinement to produce structured, high-quality content across multiple formats. ## 🏗️ Data Processing Pipeline The L3 refinement process transforms raw web text into structured content with clear reasoning and diverse pedagogical formats through the following steps: - **Q&A Pair Generation**: Rewrite declarative web content into question-answer pairs with explicit reasoning steps, categorized by difficulty level. - **Multi-turn Conversation Synthesis**: Convert web content into multi-turn dialogues simulating various interaction scenarios (e.g., teacher-student, interview, debate). - **Multi-style Rewriting**: Rewrite source content into multiple styles (e.g., textbook, Wikipedia, blog, popular science, academic paper) to improve diversity and model generalization. - **Knowledge Extraction & Textbook Generation**: Extract key knowledge points from web content and generate systematic textbook-style learning materials. - **Format Repair & Enhancement**: Fix formatting issues and enhance content coherence to achieve high-quality standards. ## ❤️ Acknowledgements - **Data Framework**: [UltraData](https://ultradata.openbmb.cn/blog/position-paper) - **Synthesis Models**: [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct), [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B), [GLM-4.5](https://huggingface.co/zai-org/GLM-4.5) ## 📖 Citation If you find **Ultra-FineWeb-L3** useful in your research, please consider citing: ```bibtex @misc{ultra-fineweb-l3, title={Ultra-FineWeb-L3}, author={UltraData Team}, year={2026}, url={https://huggingface.co/datasets/openbmb/Ultra-FineWeb-L3}, publisher={Hugging Face} } ``` ## 📜 License This project is licensed under the [Apache 2.0](./LICENSE) license.
提供机构:
openbmb
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作