five

RWKV/EagleX-WorldContinued

收藏
Hugging Face2024-06-22 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/RWKV/EagleX-WorldContinued
下载链接
链接失效反馈
官方服务:
资源简介:
EagleX-WorldContinued是一个预训练数据集,由Recursal AI等多个数据集构建而成,主要用于训练RWKV Eagle 7B模型,以继续预训练约1.1T的token,最终模型发布为RWKV EagleX v2。数据集包含英语、中文、俄语等100多种语言,格式为JSONL,每个条目代表一次对话。数据集分为多个chunk(0到9),每个chunk的大小可能略有不同。数据集的创建者包括M8than、KaraKaraWitch和Darok,资助方为Recursal.ai。数据集遵循CC-BY-SA 4.0许可。

EagleX-WorldContinued is a pretraining dataset built from many of our datasets over at Recursal AI + a few others. It was used to train RWKV Eagle 7B for continued pretrain of 1.1T tokens (approximately) (boosting it to 2.25T) with the final model being released as RWKV EagleX v2. The dataset includes English, Chinese, Russian, and 100 other languages. The format is JSONL with each line representing one conversation. The dataset is divided into chunks 0 to 9, which may have slightly varied sizes. The dataset was curated by M8than, KaraKaraWitch, and Darok, funded by Recursal.ai, and is licensed under CC-BY-SA 4.0.
提供机构:
RWKV
原始信息汇总

数据集概述

数据集描述

  • 名称: EagleX-v2-WorldContinued
  • 语言: 英语、中文、俄语及其他100多种语言
  • 许可证: cc-by-sa-4.0
  • 创建者: M8than, KaraKaraWitch, Darok
  • 资助方: Recursal.ai

数据格式

  • 格式: JSONL
  • 内容: 每行代表一个对话,每个条目包含完整的文本内容

数据分割

  • 分割: final
    • 描述: 包含完整的对话
  • 配置:
    • default: 路径为 data/*/*
    • chunk0chunk9: 路径分别为 data/dataset_chunk_0/*data/dataset_chunk_9/*

任务类别

  • 文本生成
  • 填充掩码

任务ID

  • 语言建模
  • 掩码语言建模

数据来源

  • 原始数据

引用信息

  • 参考文献:
    • Penedo, Guilherme, et al. "FineWeb." 2024.
    • Gao, Leo, et al. "The Pile: An 800GB dataset of diverse text for language modeling." arXiv preprint arXiv:21001.00027 (2020).
    • Soboleva, Daria, et al. "SlimPajama: A 627B token cleaned and deduplicated version of RedPajama." 2023.
    • Kudugunta, Sneha, et al. "MADLAD-400: A Multilingual And Document-Level Large Audited Dataset." 2023.
    • Lozhkov, Anton, et al. "StarCoder 2 and The Stack v2: The Next Generation." 2024.
    • M8than, recursal.ai. "europarl-translation-instruct." 2024.
    • M8than, recursal.ai. "europarl-conversation." 2024.
    • KaraKaraWitch, recursal.ai. "Recursalberg." 2024.
    • Darok, KaraKaraWitch, recursal.ai. "LectureGratuits." 2024.
    • M8than, recursal.ai. "arxiv-CC0-v0.5." 2024.
    • KaraKaraWitch, recursal.ai. "Stacking Exchange." 2024.
    • KaraKaraWitch, recursal.ai. "MDN." 2024.
    • Darok, KaraKaraWitch, recursal.ai. "SCP-Recursal." 2024.
    • KaraKaraWitch, recursal.ai. "SuperWIKI-1.5." 2024.
    • KaraKaraWitch, recursal.ai. "Devopedia." 2024.
    • KaraKaraWitch, recursal.ai. "FanaticFandom." 2024.
    • KaraKaraWitch, recursal.ai. "SuperWikiNEXT-32B." 2024.
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作