jonatli/the_pile_mystic
收藏Hugging Face2023-01-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jonatli/the_pile_mystic
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language_creators:
- found
language:
- en
license:
- other
multilinguality:
- monolingual
pretty_name: The Pile
size_categories:
- unknown
source_datasets:
- original
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
paperswithcode_id: the-pile
---
# Dataset Card for The Pile
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://pile.eleuther.ai/
- **Repository:** https://github.com/EleutherAI/the-pile
- **Paper:** [The Pile: An 800GB Dataset of Diverse Text for Language Modeling](https://arxiv.org/abs/2101.00027)
- **Leaderboard:**
- **Point of Contact:** [EleutherAI](mailto:contact@eleuther.ai)
**This version of the pile relies on `mystic.the-eye.eu`, a mirror of `the-eye.eu` which is currently down for me.**
### Dataset Summary
The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality
datasets combined together.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
This dataset is in English (`EN`).
## Dataset Structure
### Data Instances
#### all
```
{
'meta': {'pile_set_name': 'Pile-CC'},
'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web. Playing on...'
}
```
#### enron_emails
```
{
'text': 'Name\t\t\tNew Title\t\t\t\tEffective Date\t\t\tMid Year promotion Yes/No\n\nFloyd, Jodie\t\tSr Cust Svc Rep (no change)\t\t7/16/01\t\t\t\tNo\n\nBuehler, Craig\t\tSr Mkt/Sup Analyst (no change)\t\t7/16/01\t\t\t\tNo\n\nWagoner, Mike\t\tTeam Advisor - Gas Control\t\t7/1/01\t\t\t\tNo\n\nClapper, Karen\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nGreaney, Chris\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nWilkens, Jerry\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nMinton, Kevin\t\tPipeline Controller\t\t\t8/1/01\t\t\t\tYes\n\nCox, Don\t\tPipeline Controller\t\t\t8/1/01\t\t\t\tYes\n\nHanagriff, Richard\tSr Accounting Control Spec\t\t8/1/01\t\t\t\tYes\n\n\nThanks,\nMS'
'meta': "{}",
}
```
#### europarl
```
{
'text': 'Uvádění biocidních přípravků na trh - Nový návrh revize týkající se biocidních přípravků (rozprava) \nPředsedající\nDalším bodem je společná rozprava o následujících tématech:\nzpráva paní Sârbuové za Výbor pro životní prostředí, veřejné zdraví a bezpečnost potravin o návrhu...'
'meta': "{'language': 'cs'}",
}
```
#### free_law
```
{
'meta': "{'case_jurisdiction': 'scotus.tar.gz', 'case_ID': '110921.json','date_created': '2010-04-28T17:12:49Z'}",
'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued...'
}
```
#### hacker_news
```
{
'text': "\nChina Deserves Donald Trump - rm2889\nhttps://www.nytimes.com/2019/05/21/opinion/china-trump-trade.html\n======\nNotPaidToPost\n> so he’d be wise to curb his nationalistic “no-one-tells-China-what-to-do”\n> bluster\n\nThis comment highlights both ignorance of Chinese history and continuing\nAmerican arrogance.\n\nChina has been painfully dictated what to do during the last 200 years. This\nhas had a profound effect on the country and has led to the collapse of\nimperial rule and the drive to 'rejuvenate'...",
'meta': "{'id': '19979654'}",
}
```
#### nih_exporter
```
{
'text': "The National Domestic Violence Hotline (NDVH) and the National Dating Abuse Helpline (NDAH), which are supported by the Division of Family Violence Prevention and Services within the Family and Youth Services Bureau, serve as critical partners in the intervention, prevention, and resource assistance efforts of the network of family violence, domestic violence, and dating violence service providers. They provide crisis intervention and support services; information about resources on domestic...",
'meta': " {'APPLICATION_ID': 100065}",
}
```
#### pubmed
```
{
'meta': {'pmid': 11409574, 'language': 'eng'},
'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that included 4,021 children under 5 with acute respiratory infections (ARI) and reported the prevalence of hypoxaemia. Out-patient children and those with a clinical diagnosis of upper ARI had a low risk of hypoxaemia (pooled estimate of 6% to 9%). The prevalence increased to 31% and to 43% in patients in emergency departments and in cases with clinical pneumonia, respectively, and it was even higher among hospitalised children (47%) and in those with radiographically confirmed pneumonia (72%). The cumulated data also suggest that hypoxaemia is more frequent in children living at high altitude. Three papers reported an association between hypoxaemia and death, with relative risks varying between 1.4 and 4.6. Papers describing predictors of hypoxaemia have focused on clinical signs for detecting hypoxaemia rather than on identifying risk factors for developing this complication. Hypoxaemia is a common and potentially lethal complication of ALRI in children under 5, particularly among those with severe disease and those living at high altitude. Given the observed high prevalence of hypoxaemia and its likely association with increased mortality, efforts should be made to improve the detection of hypoxaemia and to provide oxygen earlier to more children with severe ALRI.'
}
```
#### pubmed_central
```
{
'meta': "{id': 'PMC5595690'}",
'text': 'Introduction {#acel12642-sec-0001}\n============\n\nAlzheimer\\\'s disease (AD), the most common cause of...'
}
```
#### ubuntu_irc
```
{
'text': "#ubuntu 2004-07-05\n* Window 3\n* \tServer: [0] <None>\n* \tScreen: 0x817e90c\n* \tGeometry Info: [0 11 0 11 11 11] \n* \tCO, LI are [94 49] \n* \tCurrent channel: #ubuntu\n* \tQuery User: <None> \n*\tPrompt: <None>\n* \tSecond status line is OFF\n* \tSplit line is ON triple is OFF\n* \tLogging is ON\n* \tLogfile is irclogs/ubuntu.log\n* \tNotification is OFF\n* \tHold mode is OFF\n* \tWindow level is NONE\n* \tLastlog level is ALL\n* \tNotify level is ALL\n<mdz> lifeless: using tla effectively for all packages in Warty requ...",
'meta': "{'channel': 'ubuntu', 'month': 7}"
}
```
#### uspto
```
{
'text': "1. Field of the Invention\nIn an extensive plant breeding program, Grant Merrill, originator and now deceased, originated a large number of new and distinct varieties of fruit trees, and which included the herein-claimed variety of peach tree. Such plant breeding program was undertaken in originator's experimental orchard located near Exeter, Tulare County, Calif.\n2. Prior Varieties\nAmong the existent varieties of peach trees which were known to originator, particular reference is made to Gemfree (U.S. Plant Pat. No. 1,409) and June Lady (U.S. Plant Pat. No. 3,022) hereinafter mentioned for the purpose of comparison.",
'meta': "{'bibliographic_information': {'Patent Number': 'PP0049700', 'Series Code': '6', 'Application Number': '2845415', 'Application Type': '6', 'Art unit': '337', 'Application Filing Date': '19810720', 'Title of Invention': 'Peach tree (A3-10)', 'Issue Date': '19830104', 'Number of Claims': '1', 'Exemplary Claim Number(s)': '1', 'Primary Examiner': 'Bagwill; Robert E.', 'Number of Drawing Sheets': '1', 'Number of figures': '1'}, 'source_file': 'https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/1983/pftaps19830104_wk01.zip', 'abstract': 'A peach tree which is large, vigorous, and spreading; foliated with large, lanceolate leaves having a finely serrate margin, a petiole of medium length and thickness, and medium size, reniform glands; blooms from medium size, conic, plump, pubescent buds; the flowers, medium in blooming period compared with other varieties, being of medium size, and pink; and is a regular and very productive bearer of medium but variable size, round truncate, clingstone fruit having yellow skin substantially overspread with red, yellow flesh mottled with red adjacent the skin, and an amber stone.', 'classifications': [{'OCL': ['Plt', '43'], 'EDF': ['3'], 'ICL': ['A01H', '503'], 'FSC': ['Plt'], 'FSS': ['43']}], 'inventors': [{'inventor name': 'Merrill, deceased; Grant', 'Street': '325 Breese Ave.', 'City': 'late of Red Bluff', 'State': 'CA'}, {'inventor name': 'Merrill, executrix; by Lucile B.', 'Street': '325 Breese Ave.', 'City': 'Red Bluff', 'State': 'CA', 'Zip code': '96080'}]}"
}
```
### Data Fields
#### all
- `text` (str): Text.
- `meta` (dict): Metadata of the data instance with keys:
- pile_set_name: Name of the subset.
#### enron_emails
- `text` (str): Text.
- `meta` (str): Metadata of the data instance.
#### europarl
- `text` (str): Text.
- `meta` (str): Metadata of the data instance with: language.
#### free_law
- `text` (str): Text.
- `meta` (str): Metadata of the data instance with: case_ID, case_jurisdiction, date_created.
#### hacker_news
- `text` (str): Text.
- `meta` (str): Metadata of the data instance with: id.
#### nih_exporter
- `text` (str): Text.
- `meta` (str): Metadata of the data instance with: APPLICATION_ID.
#### pubmed
- `text` (str): Text.
- `meta` (str): Metadata of the data instance with: pmid, language.
#### pubmed_central
- `text` (str): Text.
- `meta` (str): Metadata of the data instance with: ID of the data instance.
#### ubuntu_irc
- `text` (str): Text.
- `meta` (str): Metadata of the data instance with: channel, month.
#### uspto
- `text` (str): Text.
- `meta` (str): Metadata of the data instance with: bibliographic_information, source_file, abstract, classifications,
inventors.
### Data Splits
The "all" configuration is composed of 3 splits: train, validation and test.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
Please refer to the specific license depending on the subset you use:
- PubMed Central: [MIT License](https://github.com/EleutherAI/pile-pubmedcentral/blob/master/LICENSE)
### Citation Information
```
@misc{gao2020pile,
title={The Pile: An 800GB Dataset of Diverse Text for Language Modeling},
author={Leo Gao and Stella Biderman and Sid Black and Laurence Golding and Travis Hoppe and Charles Foster and Jason Phang and Horace He and Anish Thite and Noa Nabeshima and Shawn Presser and Connor Leahy},
year={2020},
eprint={2101.00027},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
### Contributions
Thanks to [@github-username](https://github.com/<github-username>) for adding this dataset.
提供机构:
jonatli
原始信息汇总
数据集概述
数据集名称
- 名称: The Pile
- 别名: 无
数据集基本信息
- 语言: 英语 (
EN) - 许可证: 其他
- 多语言性: 单语种
- 数据集大小: 825 GiB
- 数据集来源: 原始数据集
- 任务类别: 文本生成, 填空
- 任务ID: 语言建模, 掩码语言建模
- 论文代码ID: the-pile
数据集结构
- 数据实例: 包含多个子集,如Pile-CC, enron_emails, europarl等,每个子集的数据实例结构略有不同,但主要包含
text字段和meta字段。 - 数据字段:
text(str): 文本内容。meta(dict): 元数据,不同子集的元数据内容不同,如pile_set_name, id, language等。
- 数据分割: 包含训练集、验证集和测试集。
数据集创建
- 来源数据: 原始收集的数据。
- 注释: 无注释。
- 个人和敏感信息: 信息未明确。
使用数据注意事项
- 社会影响: 信息未明确。
- 偏见讨论: 信息未明确。
- 其他已知限制: 信息未明确。
附加信息
-
数据集管理员: 信息未明确。
-
许可证信息: 根据使用的子集不同,许可证可能不同,例如PubMed Central使用MIT许可证。
-
引用信息:
@misc{gao2020pile, title={The Pile: An 800GB Dataset of Diverse Text for Language Modeling}, author={Leo Gao and Stella Biderman and Sid Black and Laurence Golding and Travis Hoppe and Charles Foster and Jason Phang and Horace He and Anish Thite and Noa Nabeshima and Shawn Presser and Connor Leahy}, year={2020}, eprint={2101.00027}, archivePrefix={arXiv}, primaryClass={cs.CL} }
-
贡献者: 信息未明确。
搜集汇总
数据集介绍

构建方式
在自然语言处理领域,大规模语料库的构建对于语言模型训练至关重要。The Pile数据集通过整合22个高质量子集,形成了一个总量达825 GiB的多样化文本集合。其构建过程涉及从多个公开来源收集原始数据,涵盖学术文献、法律文档、技术论坛及专利文本等广泛领域。每个子集均经过精心筛选与标准化处理,确保文本的完整性与一致性,最终形成一个结构化的、适用于语言建模任务的综合性数据集。
特点
该数据集以其卓越的多样性与规模著称,覆盖了从学术研究到日常交流的广泛文本类型。其子集包括PubMed的医学论文、FreeLaw的法律案例、Hacker News的社区讨论以及USPTO的专利文档等,展现了跨领域的语言表达特征。每个数据实例均附带丰富的元数据,如来源标识、语言信息及创建日期,为深入研究文本的上下文与属性提供了有力支持。这种多层次的结构设计使得数据集不仅能服务于通用语言模型训练,还能适应特定领域的分析需求。
使用方法
在语言模型训练与评估中,该数据集可直接应用于文本生成、掩码语言建模等核心任务。研究人员可通过HuggingFace平台加载数据集,利用其预定义的数据分割(训练集、验证集和测试集)进行模型训练与性能验证。使用时应关注各子集的许可协议,确保合规使用。数据实例以文本字段与元数据字典的形式呈现,便于进行定制化预处理或特征提取,从而支持从基础语言理解到复杂文本分析的多样化研究目标。
背景与挑战
背景概述
在自然语言处理领域,大规模、高质量的语言建模数据集是推动模型性能突破的关键基石。The Pile数据集由EleutherAI研究团队于2020年创建,旨在构建一个规模达825GB的多样化开源文本语料库,以支持前沿语言模型的训练与评估。该数据集汇聚了22个高质量子集,涵盖学术文献、法律文本、技术讨论及公共邮件等多种领域,其核心研究问题在于解决传统语料库在领域覆盖广度与数据质量上的局限,为生成式与掩码语言模型提供了更为丰富和均衡的训练资源,显著提升了模型在复杂语境下的理解和生成能力。
当前挑战
The Pile数据集致力于应对语言建模中数据多样性不足与领域偏差的挑战,传统语料往往集中于新闻或网页文本,导致模型在专业或小众领域表现欠佳。在构建过程中,团队面临多重困难:其一,数据源的整合与标准化需处理异构格式与许可协议,确保法律与伦理合规;其二,维持子集间质量均衡,避免噪声数据污染整体语料;其三,原始数据包含敏感或个人隐私信息,需进行细致清洗与匿名化处理,以符合负责任AI研究的标准。
常用场景
经典使用场景
在自然语言处理领域,大规模预训练语言模型的兴起对高质量、多样化的文本数据提出了迫切需求。The Pile数据集通过整合22个不同来源的高质量子集,构建了一个涵盖学术文献、法律文书、技术论坛、专利文本等多领域的庞大语料库。其最经典的使用场景是作为语言模型的预训练数据,为模型提供丰富的语言结构和知识表示,从而提升模型在各类下游任务中的泛化能力。研究人员利用该数据集训练诸如GPT-Neo等模型,探索模型规模与性能之间的关系,推动了大规模语言建模技术的发展。
衍生相关工作
The Pile数据集催生了一系列重要的衍生研究工作。其中最著名的是EleutherAI团队基于该数据训练的GPT-Neo和GPT-J模型,这些开源模型在性能上媲美部分商业模型,推动了开源大模型社区的发展。此外,许多研究利用该数据集的子集进行领域适应性训练,如生物医学文本挖掘、法律智能分析等。数据集还被用于探索模型训练中的数据混合策略、评估数据多样性对模型性能的影响,以及研究多语言与跨领域迁移学习,为后续如Pythia、BLOOM等大型语言模型项目的开发提供了宝贵的数据基础与实验参照。
数据集最近研究
最新研究方向
在自然语言处理领域,大规模预训练语言模型的兴起推动了对高质量、多样化文本数据的需求。The Pile数据集作为一项重要的开放资源,其最新研究方向聚焦于利用其涵盖学术文献、法律文本、技术论坛等22个子集的广泛内容,探索领域自适应与知识增强的语言模型训练。前沿工作致力于分析数据集中潜在的社会偏见与伦理问题,以提升模型在医疗、法律等专业场景下的可靠性与公平性。同时,该数据集在推动开源社区模型发展方面影响显著,为多任务学习与长文本生成等热点提供了关键数据支撑,促进了语言智能技术的透明化与可复现性。
以上内容由遇见数据集搜集并总结生成



