minipile

Name: minipile
Creator: OpenDataLab
License: 暂无描述

OpenXLab2026-04-18 收录

下载链接：

https://openxlab.org.cn/datasets/OpenDataLab/minipile

下载链接

链接失效反馈

官方服务：

资源简介：

MiniPile is a 6GB subset of the deduplicated The Pile corpus. To curate MiniPile, we perform a simple, three-step data filtering process: we (1) infer embeddings for all documents of the Pile, (2) cluster the embedding space using k-means, and (3) filter out low-quality clusters. The primary motivation for curating MiniPile is that (i) diverse pre-training datasets (like the Pile) are often too large for academic budgets and (ii) most smaller-scale datasets are fairly homogeneous and thereby unrepresentative of contemporary general-purpose language models. MiniPile aims to fill this gap and thereby facilitate data-efficient research on model architectures, training procedures, optimizers, etc.

提供机构：

OpenDataLab

创建时间：

2023-12-14