pile

Opencsg2024-07-19 更新2025-05-03 收录

下载链接：

https://www.opencsg.com/datasets/AIWizards/pile

下载链接

链接失效反馈

官方服务：

资源简介：

The Pile是一个多元化的开源语言建模数据集，规模达825 GiB，由22个高质量的小型数据集组合而成。它主要用于文本生成和掩码填充等任务，并支持语言建模和掩码语言建模。该数据集包含英文文本，并提供多种数据子集，例如Enron邮件、欧洲议会语料、Free Law、Hacker News、NIH Exporter、PubMed、PubMed Central、Ubuntu IRC、USPTO和GitHub等。每个子集包含文本内容和元数据，元数据包括来源、ID、作者等信息。数据集遵循特定的许可协议，例如PubMed Central采用MIT许可证。它提供标准化数据操作，方便用户进行数据分析和建模。

The Pile is a diverse open-source language modeling dataset with a size of 825 GiB, composed of 22 high-quality small datasets. It is mainly used for tasks such as text generation and mask filling, and supports language modeling and masked language modeling. The dataset contains English text and offers multiple data subsets, including Enron Emails, European Parliament Corpus, Free Law, Hacker News, NIH Exporter, PubMed, PubMed Central, Ubuntu IRC, USPTO, and GitHub, etc. Each subset includes text content and metadata, where metadata covers information like source, ID, author, etc. The dataset adheres to specific license agreements—for instance, PubMed Central uses the MIT License. It provides standardized data operations to facilitate users in data analysis and modeling.

创建时间：

2024-07-19