EleutherAI/pile_val_test
收藏Hugging Face2026-02-23 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/EleutherAI/pile_val_test
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
pretty_name: The Pile - Validation & Test Splits
---
# The Pile: Validation and Test Splits
This repo contains the validation and test splits of [The Pile](https://pile.eleuther.ai/), an 825 GiB English text dataset designed for training large language models.
## Files
| File | Split | Size |
|------|-------|------|
| `val.jsonl` | Validation | 1.4 GB |
| `test.jsonl` | Test | 1.3 GB |
## Format
Each line is a JSON object with two fields:
```json
{"text": "The document text...", "meta": {"pile_set_name": "Pile-CC"}}
```
The `meta.pile_set_name` field indicates which of the 22 constituent datasets the document came from (e.g., Pile-CC, PubMed Central, ArXiv, GitHub, etc.).
## Citation
```bibtex
@article{gao2020pile,
title={The Pile: An 800GB Dataset of Diverse Text for Language Modeling},
author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
journal={arXiv preprint arXiv:2101.00027},
year={2020}
}
```
提供机构:
EleutherAI



