five

oak

收藏
魔搭社区2025-11-27 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/tabularisai/oak
下载链接
链接失效反馈
官方服务:
资源简介:
### NEWS: - A new version of the dataset with 120,000,000 more tokens is upload: **OAK v1.1** # Open Artificial Knowledge (OAK) Dataset <p align="center"> <img src="oak_logo.png" alt="OAK LOGO" width="320"> </p> ## Overview The Open Artificial Knowledge (OAK) dataset is a large-scale resource of over 650 Millions tokens designed to address the challenges of acquiring high-quality, diverse, and ethically sourced training data for Large Language Models (LLMs). OAK leverages an ensemble of state-of-the-art LLMs to generate high-quality text across diverse domains, guided by Wikipedia's main categories. ## Key Features - **653,552,076** tokens of high quality synthetic data - Generated using **GPT4o, LLaMa3-70B, LLaMa3-8B, Mixtral-8x7B, Gemma-7B, and Gemma-2-9B** - Broad knowledge coverage - Designed to foster the development of more capable and aligned language models # Download ```python from datasets import load_dataset ds = load_dataset("tabularisai/oak", split="train") ds[0] ``` ## Dataset Creation Process 1. **Subject Extraction**: High-level topics are extracted from Wikipedia. 2. **Subtopic Expansion**: Topics are expanded into detailed subtopics using advanced language models like GPT-4o. 3. **Prompt Generation**: Prompts are created using programming prompt engineering and meta-prompt techniques. 4. **Text Generation**: Content is generated using various open-source LLMs. ## Future Work - Increase dataset volume - Add more languages - Incorporate more advanced and diverse models - Refine the dataset's application in code-related tasks - Foster community contributions ## Citation ```bib @misc{borisov2024open, title={Open Artificial Knowledge}, author={Vadim Borisov and Richard H. Schreiber}, year={2024}, eprint={2407.14371}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.14371}, } ``` ## Links - Webiste: www.oakdataset.org - Paper: https://arxiv.org/abs/2407.14371 - GitHub: https://github.com/tabularis-ai/oak-dataset ## Disclaimer Users must adhere to ethical guidelines, respect privacy considerations, and be mindful of potential biases in the synthetic data. The OAK dataset is intended for research purposes only. ## Contact For questions or more data, please contact: `info@tabularis.ai` www.tabularis.ai

### 新闻动态: - 新增含120,000,000个Token的数据集新版本:**OAK v1.1** # 开放人工知识(Open Artificial Knowledge, OAK)数据集 <p align="center"> <img src="oak_logo.png" alt="OAK 标识" width="320"> </p> ## 概述 开放人工知识(Open Artificial Knowledge, OAK)数据集是一个超6.5亿Token的大规模资源,旨在解决大语言模型(Large Language Model, LLM)获取高质量、多样化且伦理合规溯源训练数据的挑战。OAK依托一系列前沿大语言模型,以维基百科核心分类为指引,在多领域生成高质量文本。 ## 核心特性 - **653,552,076** 个高质量合成数据Token - 采用**GPT-4o、LLaMa3-70B、LLaMa3-8B、Mixtral-8x7B、Gemma-7B及Gemma-2-9B**生成 - 覆盖广泛知识领域 - 旨在助力更具能力且对齐的语言模型研发 ## 下载方式 python from datasets import load_dataset ds = load_dataset("tabularisai/oak", split="train") ds[0] ## 数据集构建流程 1. **主题提取**:从维基百科中提取高层级主题 2. **子主题扩展**:借助GPT-4o等先进语言模型,将主题拓展为详细子主题 3. **提示词生成**:通过程序化提示工程与元提示技术构建提示词 4. **文本生成**:使用多款开源大语言模型生成内容 ## 未来工作计划 - 扩大数据集规模 - 新增支持更多语言 - 集成更多先进且多样化的模型 - 优化数据集在代码相关任务中的应用效果 - 推动社区贡献 ## 引用格式 bib @misc{borisov2024open, title={Open Artificial Knowledge}, author={Vadim Borisov and Richard H. Schreiber}, year={2024}, eprint={2407.14371}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.14371}, } ## 相关链接 - 官方网站:www.oakdataset.org - 论文链接:https://arxiv.org/abs/2407.14371 - GitHub仓库:https://github.com/tabularis-ai/oak-dataset ## 免责声明 用户需遵守伦理准则,尊重隐私考量,并留意合成数据中可能存在的偏见。 本OAK数据集仅用于科研用途。 ## 联系方式 如有疑问或需获取更多数据,请联系:`info@tabularis.ai` www.tabularis.ai
提供机构:
maas
创建时间:
2025-09-26
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作