multilingual-mi-llm/pile
收藏Hugging Face2024-09-08 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/multilingual-mi-llm/pile
下载链接
链接失效反馈官方服务:
资源简介:
The Pile是一个大型、多样化的开源语言建模数据集,由多个较小的数据集组合而成。其目标是尽可能从多种模态中获取文本,以确保使用The Pile训练的模型具有更广泛的泛化能力。该数据集包含多个组件,每个组件都有其原始大小、权重、周期、有效大小和平均文档大小。The Pile的初始版本仅包含英语数据。
The Pile is a large, diverse, open-source language modeling dataset that consists of many smaller datasets combined together. The objective is to obtain text from as many modalities as possible to ensure that models trained using The Pile will have much broader generalization abilities. The dataset includes multiple components, each with its raw size, weight, epochs, effective size, and mean document size. The initial release of The Pile is English-only.
提供机构:
multilingual-mi-llm



