JeanKaddour/minipile

Name: JeanKaddour/minipile
Creator: JeanKaddour
Published: 2023-06-20 10:08:26
License: 暂无描述

Hugging Face2023-06-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/JeanKaddour/minipile

下载链接

链接失效反馈

官方服务：

资源简介：

MiniPile是从The Pile语料库中提取的6GB子集，旨在为数据高效的语言模型研究提供支持。为了创建MiniPile，我们执行了一个简单的三步数据过滤过程：(1) 推断The Pile所有文档的嵌入，(2) 使用k-means对嵌入空间进行聚类，(3) 过滤掉低质量的聚类。MiniPile的主要动机是：(i) 多样化的预训练数据集（如The Pile）通常对于学术预算来说太大，(ii) 大多数较小规模的数据集相当同质，因此无法代表当代通用语言模型。MiniPile旨在填补这一空白，从而促进模型架构、训练程序、优化器等方面的数据高效研究。更多关于MiniPile的策展过程和一些预训练结果的详细信息可以在MiniPile论文中找到。

MiniPile is a 6GB subset extracted from The Pile corpus, designed to support data-efficient language model research. To create MiniPile, we implemented a simple three-step data filtering pipeline: (1) infer embeddings for all documents in The Pile, (2) cluster the embedding space using k-means, (3) filter out low-quality clusters. The primary motivations for MiniPile are twofold: (i) diversified pre-training datasets such as The Pile are typically too large for academic budgets, (ii) most smaller-scale datasets are rather homogeneous, thus failing to represent contemporary general-purpose language models. MiniPile aims to fill this gap, thereby facilitating data-efficient research on model architectures, training procedures, optimizers, and other related aspects. Further details about the curation process of MiniPile and some pre-training results can be found in the MiniPile paper.

提供机构：

JeanKaddour

原始信息汇总

数据集概述

数据集名称

MiniPile

数据集大小

下载大小：3177432813字节
数据集大小：5967446087字节

数据集特征

特征名称：text
数据类型：string

数据集分割

训练集：
- 示例数量：1000000
- 字节数：5906108510
验证集：
- 示例数量：500
- 字节数：2779386
测试集：
- 示例数量：10000
- 字节数：58558191

语言

语言：英语 (EN)

许可证

许可证类型：其他

多语言性

多语言性：单语种

数据集来源

数据集来源：原始

任务类别

任务类别：
- 文本生成
- 填充掩码

任务ID

任务ID：
- 语言建模
- 掩码语言建模

论文代码ID

paperswithcode_id：minipile

5,000+

优质数据集

54 个

任务类型

进入经典数据集