five

EleutherAI/pile_val_test

收藏
Hugging Face2026-02-23 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/EleutherAI/pile_val_test
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en pretty_name: The Pile - Validation & Test Splits --- # The Pile: Validation and Test Splits This repo contains the validation and test splits of [The Pile](https://pile.eleuther.ai/), an 825 GiB English text dataset designed for training large language models. ## Files | File | Split | Size | |------|-------|------| | `val.jsonl` | Validation | 1.4 GB | | `test.jsonl` | Test | 1.3 GB | ## Format Each line is a JSON object with two fields: ```json {"text": "The document text...", "meta": {"pile_set_name": "Pile-CC"}} ``` The `meta.pile_set_name` field indicates which of the 22 constituent datasets the document came from (e.g., Pile-CC, PubMed Central, ArXiv, GitHub, etc.). ## Citation ```bibtex @article{gao2020pile, title={The Pile: An 800GB Dataset of Diverse Text for Language Modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor}, journal={arXiv preprint arXiv:2101.00027}, year={2020} } ```
提供机构:
EleutherAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作