The Pile An 800GB Dataset of Diverse Text for Language Modeling

Name: The Pile An 800GB Dataset of Diverse Text for Language Modeling
Creator: academictorrents.com
License: 暂无描述

academictorrents.com2025-03-22 收录

下载链接：

https://academictorrents.com/details/0d366035664fdf51cfbe9f733953ba325776e667

下载链接

链接失效反馈

官方服务：

资源简介：

## What is the Pile? The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together. ## Why is the Pile a good training set? Recent work has shown that especially for large models, diversity in data sources improves general cross-domain knowledge of the model, as well as downstream generalization capability. In our evaluations, not only do models trained on the Pile show moderate improvements in traditional language modeling benchmarks, they also show significant improvements on Pile BPB. ## Why is the Pile a good benchmark? To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers. Pile BPB is a measure of world knowledge and reasoning ability in these domains, making it a robust benchmark of general, cross-domain text modeling ability for la

何为Pile？Pile是一个由22个较小的高质量数据集组合而成的，容量达到825吉字节的多源、开源语言建模数据集。为何Pile是一个优秀的训练集？近期的研究表明，特别是在大型模型中，数据来源的多样性能够提升模型在跨域知识上的普遍性，以及下游泛化能力。在我们的评估中，不仅基于Pile训练的模型在传统的语言建模基准测试中显示出适度的提升，它们在Pile BPB（每字节比特）上也表现出了显著的进步。为何Pile是一个优秀的基准测试？要在Pile BPB（每字节比特）上获得高分，模型必须能够理解包括书籍、GitHub代码仓库、网页、聊天记录以及医学、物理学、数学、计算机科学和哲学论文在内的众多不同领域。Pile BPB是衡量这些领域世界知识和推理能力的指标，因此它成为了一个衡量通用、跨域文本建模能力的稳健基准。

提供机构：

academictorrents.com

5,000+

优质数据集

54 个

任务类型

进入经典数据集