oak

Name: oak
Creator: maas
Published: 2025-11-27 16:49:43
License: 暂无描述

魔搭社区2025-11-27 更新2025-09-27 收录

下载链接：

https://modelscope.cn/datasets/tabularisai/oak

下载链接

链接失效反馈

官方服务：

资源简介：

### NEWS: - A new version of the dataset with 120,000,000 more tokens is upload: **OAK v1.1** # Open Artificial Knowledge (OAK) Dataset <p align="center"> <img src="oak_logo.png" alt="OAK LOGO" width="320"> </p> ## Overview The Open Artificial Knowledge (OAK) dataset is a large-scale resource of over 650 Millions tokens designed to address the challenges of acquiring high-quality, diverse, and ethically sourced training data for Large Language Models (LLMs). OAK leverages an ensemble of state-of-the-art LLMs to generate high-quality text across diverse domains, guided by Wikipedia's main categories. ## Key Features - **653,552,076** tokens of high quality synthetic data - Generated using **GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, Gemma-7B, and Gemma-2-9B** - Broad knowledge coverage - Designed to foster the development of more capable and aligned language models # Download ```python from datasets import load_dataset ds = load_dataset("tabularisai/oak", split="train") ds[0] ``` ## Dataset Creation Process 1. **Subject Extraction**: High-level topics are extracted from Wikipedia. 2. **Subtopic Expansion**: Topics are expanded into detailed subtopics using advanced language models like GPT-4o. 3. **Prompt Generation**: Prompts are created using programming prompt engineering and meta-prompt techniques. 4. **Text Generation**: Content is generated using various open-source LLMs. ## Future Work - Increase dataset volume - Add more languages - Incorporate more advanced and diverse models - Refine the dataset's application in code-related tasks - Foster community contributions ## Citation ```bib @misc{borisov2024open, title={Open Artificial Knowledge}, author={Vadim Borisov and Richard H. Schreiber}, year={2024}, eprint={2407.14371}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.14371}, } ``` ## Links - Webiste: www.oakdataset.org - Paper: https://arxiv.org/abs/2407.14371 - GitHub: https://github.com/tabularis-ai/oak-dataset ## Disclaimer Users must adhere to ethical guidelines, respect privacy considerations, and be mindful of potential biases in the synthetic data. The OAK dataset is intended for research purposes only. ## Contact For questions or more data, please contact: `info@tabularis.ai` www.tabularis.ai

### 新闻动态： - 新增含120,000,000个Token的数据集新版本：**OAK v1.1** # 开放人工知识（Open Artificial Knowledge, OAK）数据集 <p align="center"> <img src="oak_logo.png" alt="OAK 标识" width="320"> </p> ## 概述开放人工知识（Open Artificial Knowledge, OAK）数据集是一个超6.5亿Token的大规模资源，旨在解决大语言模型（Large Language Model, LLM）获取高质量、多样化且伦理合规溯源训练数据的挑战。OAK依托一系列前沿大语言模型，以维基百科核心分类为指引，在多领域生成高质量文本。 ## 核心特性 - **653,552,076** 个高质量合成数据Token - 采用**GPT-4o、LLaMa3-70B、LLaMa3-8B、Mixtral-8x7B、Gemma-7B及Gemma-2-9B**生成 - 覆盖广泛知识领域 - 旨在助力更具能力且对齐的语言模型研发 ## 下载方式 python from datasets import load_dataset ds = load_dataset("tabularisai/oak", split="train") ds[0] ## 数据集构建流程 1. **主题提取**：从维基百科中提取高层级主题 2. **子主题扩展**：借助GPT-4o等先进语言模型，将主题拓展为详细子主题 3. **提示词生成**：通过程序化提示工程与元提示技术构建提示词 4. **文本生成**：使用多款开源大语言模型生成内容 ## 未来工作计划 - 扩大数据集规模 - 新增支持更多语言 - 集成更多先进且多样化的模型 - 优化数据集在代码相关任务中的应用效果 - 推动社区贡献 ## 引用格式 bib @misc{borisov2024open, title={Open Artificial Knowledge}, author={Vadim Borisov and Richard H. Schreiber}, year={2024}, eprint={2407.14371}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.14371}, } ## 相关链接 - 官方网站：www.oakdataset.org - 论文链接：https://arxiv.org/abs/2407.14371 - GitHub仓库：https://github.com/tabularis-ai/oak-dataset ## 免责声明用户需遵守伦理准则，尊重隐私考量，并留意合成数据中可能存在的偏见。本OAK数据集仅用于科研用途。 ## 联系方式如有疑问或需获取更多数据，请联系：`info@tabularis.ai` www.tabularis.ai

提供机构：

maas

创建时间：

2025-09-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集