oak
收藏魔搭社区2025-11-27 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/tabularisai/oak
下载链接
链接失效反馈官方服务:
资源简介:
### NEWS:
- A new version of the dataset with 120,000,000 more tokens is upload: **OAK v1.1**
# Open Artificial Knowledge (OAK) Dataset
<p align="center">
<img src="oak_logo.png" alt="OAK LOGO" width="320">
</p>
## Overview
The Open Artificial Knowledge (OAK) dataset is a large-scale resource of over 650 Millions tokens designed to address the challenges of acquiring high-quality, diverse, and ethically sourced training data for Large Language Models (LLMs). OAK leverages an ensemble of state-of-the-art LLMs to generate high-quality text across diverse domains, guided by Wikipedia's main categories.
## Key Features
- **653,552,076** tokens of high quality synthetic data
- Generated using **GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, Gemma-7B, and Gemma-2-9B**
- Broad knowledge coverage
- Designed to foster the development of more capable and aligned language models
# Download
```python
from datasets import load_dataset
ds = load_dataset("tabularisai/oak", split="train")
ds[0]
```
## Dataset Creation Process
1. **Subject Extraction**: High-level topics are extracted from Wikipedia.
2. **Subtopic Expansion**: Topics are expanded into detailed subtopics using advanced language models like GPT-4o.
3. **Prompt Generation**: Prompts are created using programming prompt engineering and meta-prompt techniques.
4. **Text Generation**: Content is generated using various open-source LLMs.
## Future Work
- Increase dataset volume
- Add more languages
- Incorporate more advanced and diverse models
- Refine the dataset's application in code-related tasks
- Foster community contributions
## Citation
```bib
@misc{borisov2024open,
title={Open Artificial Knowledge},
author={Vadim Borisov and Richard H. Schreiber},
year={2024},
eprint={2407.14371},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.14371},
}
```
## Links
- Webiste: www.oakdataset.org
- Paper: https://arxiv.org/abs/2407.14371
- GitHub: https://github.com/tabularis-ai/oak-dataset
## Disclaimer
Users must adhere to ethical guidelines, respect privacy considerations, and be mindful of potential biases in the synthetic data.
The OAK dataset is intended for research purposes only.
## Contact
For questions or more data, please contact: `info@tabularis.ai`
www.tabularis.ai
### 新闻动态:
- 新增含120,000,000个Token的数据集新版本:**OAK v1.1**
# 开放人工知识(Open Artificial Knowledge, OAK)数据集
<p align="center">
<img src="oak_logo.png" alt="OAK 标识" width="320">
</p>
## 概述
开放人工知识(Open Artificial Knowledge, OAK)数据集是一个超6.5亿Token的大规模资源,旨在解决大语言模型(Large Language Model, LLM)获取高质量、多样化且伦理合规溯源训练数据的挑战。OAK依托一系列前沿大语言模型,以维基百科核心分类为指引,在多领域生成高质量文本。
## 核心特性
- **653,552,076** 个高质量合成数据Token
- 采用**GPT-4o、LLaMa3-70B、LLaMa3-8B、Mixtral-8x7B、Gemma-7B及Gemma-2-9B**生成
- 覆盖广泛知识领域
- 旨在助力更具能力且对齐的语言模型研发
## 下载方式
python
from datasets import load_dataset
ds = load_dataset("tabularisai/oak", split="train")
ds[0]
## 数据集构建流程
1. **主题提取**:从维基百科中提取高层级主题
2. **子主题扩展**:借助GPT-4o等先进语言模型,将主题拓展为详细子主题
3. **提示词生成**:通过程序化提示工程与元提示技术构建提示词
4. **文本生成**:使用多款开源大语言模型生成内容
## 未来工作计划
- 扩大数据集规模
- 新增支持更多语言
- 集成更多先进且多样化的模型
- 优化数据集在代码相关任务中的应用效果
- 推动社区贡献
## 引用格式
bib
@misc{borisov2024open,
title={Open Artificial Knowledge},
author={Vadim Borisov and Richard H. Schreiber},
year={2024},
eprint={2407.14371},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.14371},
}
## 相关链接
- 官方网站:www.oakdataset.org
- 论文链接:https://arxiv.org/abs/2407.14371
- GitHub仓库:https://github.com/tabularis-ai/oak-dataset
## 免责声明
用户需遵守伦理准则,尊重隐私考量,并留意合成数据中可能存在的偏见。
本OAK数据集仅用于科研用途。
## 联系方式
如有疑问或需获取更多数据,请联系:`info@tabularis.ai`
www.tabularis.ai
提供机构:
maas
创建时间:
2025-09-26



