EleutherAI/pile

Hugging Face2023-05-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/EleutherAI/pile

下载链接

链接失效反馈

资源简介：

--- annotations_creators: - no-annotation language_creators: - found language: - en license: other multilinguality: - monolingual pretty_name: the Pile size_categories: - 100B<n<1T source_datasets: - original task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling paperswithcode_id: the-pile --- # Dataset Card for The Pile ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) This model card is a work in progress. Please also see [our datasheet](https://arxiv.org/abs/2201.07311) for more detailed info. ## Dataset Description - **Homepage:** https://pile.eleuther.ai/ - **Repository:** https://github.com/EleutherAI/the-pile - **Paper:** [The Pile: An 800GB Dataset of Diverse Text for Language Modeling](https://arxiv.org/abs/2101.00027) - **Leaderboard:** - **Point of Contact:** [EleutherAI](mailto:contact@eleuther.ai) - **Datasheet:** [Datasheet for the Pile](https://arxiv.org/abs/2201.07311) ### Dataset Summary The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages This dataset is in English (`EN`). ## Dataset Structure ### Data Instances #### all ``` { 'meta': {'pile_set_name': 'Pile-CC'}, 'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web. Playing on...' } ``` <details> <summary>Expand to see individual components</summary> #### enron_emails ``` { 'text': 'Name\t\t\tNew Title\t\t\t\tEffective Date\t\t\tMid Year promotion Yes/No\n\nFloyd, Jodie\t\tSr Cust Svc Rep (no change)\t\t7/16/01\t\t\t\tNo\n\nBuehler, Craig\t\tSr Mkt/Sup Analyst (no change)\t\t7/16/01\t\t\t\tNo\n\nWagoner, Mike\t\tTeam Advisor - Gas Control\t\t7/1/01\t\t\t\tNo\n\nClapper, Karen\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nGreaney, Chris\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nWilkens, Jerry\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nMinton, Kevin\t\tPipeline Controller\t\t\t8/1/01\t\t\t\tYes\n\nCox, Don\t\tPipeline Controller\t\t\t8/1/01\t\t\t\tYes\n\nHanagriff, Richard\tSr Accounting Control Spec\t\t8/1/01\t\t\t\tYes\n\n\nThanks,\nMS' 'meta': "{}", } ``` #### europarl ``` { 'text': 'Uvádění biocidních přípravků na trh - Nový návrh revize týkající se biocidních přípravků (rozprava) \nPředsedající\nDalším bodem je společná rozprava o následujících tématech:\nzpráva paní Sârbuové za Výbor pro životní prostředí, veřejné zdraví a bezpečnost potravin o návrhu...' 'meta': "{'language': 'cs'}", } ``` #### free_law ``` { 'meta': "{'case_jurisdiction': 'scotus.tar.gz', 'case_ID': '110921.json','date_created': '2010-04-28T17:12:49Z'}", 'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued...' } ``` #### hacker_news ``` { 'text': "\nChina Deserves Donald Trump - rm2889\nhttps://www.nytimes.com/2019/05/21/opinion/china-trump-trade.html\n======\nNotPaidToPost\n> so he’d be wise to curb his nationalistic “no-one-tells-China-what-to-do”\n> bluster\n\nThis comment highlights both ignorance of Chinese history and continuing\nAmerican arrogance.\n\nChina has been painfully dictated what to do during the last 200 years. This\nhas had a profound effect on the country and has led to the collapse of\nimperial rule and the drive to 'rejuvenate'...", 'meta': "{'id': '19979654'}", } ``` #### nih_exporter ``` { 'text': "The National Domestic Violence Hotline (NDVH) and the National Dating Abuse Helpline (NDAH), which are supported by the Division of Family Violence Prevention and Services within the Family and Youth Services Bureau, serve as critical partners in the intervention, prevention, and resource assistance efforts of the network of family violence, domestic violence, and dating violence service providers. They provide crisis intervention and support services; information about resources on domestic...", 'meta': " {'APPLICATION_ID': 100065}", } ``` #### pubmed ``` { 'meta': {'pmid': 11409574, 'language': 'eng'}, 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that included 4,021 children under 5 with acute respiratory infections (ARI) and reported the prevalence of hypoxaemia. Out-patient children and those with a clinical diagnosis of upper ARI had a low risk of hypoxaemia (pooled estimate of 6% to 9%). The prevalence increased to 31% and to 43% in patients in emergency departments and in cases with clinical pneumonia, respectively, and it was even higher among hospitalised children (47%) and in those with radiographically confirmed pneumonia (72%). The cumulated data also suggest that hypoxaemia is more frequent in children living at high altitude. Three papers reported an association between hypoxaemia and death, with relative risks varying between 1.4 and 4.6. Papers describing predictors of hypoxaemia have focused on clinical signs for detecting hypoxaemia rather than on identifying risk factors for developing this complication. Hypoxaemia is a common and potentially lethal complication of ALRI in children under 5, particularly among those with severe disease and those living at high altitude. Given the observed high prevalence of hypoxaemia and its likely association with increased mortality, efforts should be made to improve the detection of hypoxaemia and to provide oxygen earlier to more children with severe ALRI.' } ``` #### pubmed_central ``` { 'meta': "{id': 'PMC5595690'}", 'text': 'Introduction {#acel12642-sec-0001}\n============\n\nAlzheimer\\\'s disease (AD), the most common cause of...' } ``` #### ubuntu_irc ``` { 'text': "#ubuntu 2004-07-05\n* Window 3\n* \tServer: [0] <None>\n* \tScreen: 0x817e90c\n* \tGeometry Info: [0 11 0 11 11 11] \n* \tCO, LI are [94 49] \n* \tCurrent channel: #ubuntu\n* \tQuery User: <None> \n*\tPrompt: <None>\n* \tSecond status line is OFF\n* \tSplit line is ON triple is OFF\n* \tLogging is ON\n* \tLogfile is irclogs/ubuntu.log\n* \tNotification is OFF\n* \tHold mode is OFF\n* \tWindow level is NONE\n* \tLastlog level is ALL\n* \tNotify level is ALL\n<mdz> lifeless: using tla effectively for all packages in Warty requ...", 'meta': "{'channel': 'ubuntu', 'month': 7}" } ``` #### uspto ``` { 'text': "1. Field of the Invention\nIn an extensive plant breeding program, Grant Merrill, originator and now deceased, originated a large number of new and distinct varieties of fruit trees, and which included the herein-claimed variety of peach tree. Such plant breeding program was undertaken in originator's experimental orchard located near Exeter, Tulare County, Calif.\n2. Prior Varieties\nAmong the existent varieties of peach trees which were known to originator, particular reference is made to Gemfree (U.S. Plant Pat. No. 1,409) and June Lady (U.S. Plant Pat. No. 3,022) hereinafter mentioned for the purpose of comparison.", 'meta': "{'bibliographic_information': {'Patent Number': 'PP0049700', 'Series Code': '6', 'Application Number': '2845415', 'Application Type': '6', 'Art unit': '337', 'Application Filing Date': '19810720', 'Title of Invention': 'Peach tree (A3-10)', 'Issue Date': '19830104', 'Number of Claims': '1', 'Exemplary Claim Number(s)': '1', 'Primary Examiner': 'Bagwill; Robert E.', 'Number of Drawing Sheets': '1', 'Number of figures': '1'}, 'source_file': 'https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/1983/pftaps19830104_wk01.zip', 'abstract': 'A peach tree which is large, vigorous, and spreading; foliated with large, lanceolate leaves having a finely serrate margin, a petiole of medium length and thickness, and medium size, reniform glands; blooms from medium size, conic, plump, pubescent buds; the flowers, medium in blooming period compared with other varieties, being of medium size, and pink; and is a regular and very productive bearer of medium but variable size, round truncate, clingstone fruit having yellow skin substantially overspread with red, yellow flesh mottled with red adjacent the skin, and an amber stone.', 'classifications': [{'OCL': ['Plt', '43'], 'EDF': ['3'], 'ICL': ['A01H', '503'], 'FSC': ['Plt'], 'FSS': ['43']}], 'inventors': [{'inventor name': 'Merrill, deceased; Grant', 'Street': '325 Breese Ave.', 'City': 'late of Red Bluff', 'State': 'CA'}, {'inventor name': 'Merrill, executrix; by Lucile B.', 'Street': '325 Breese Ave.', 'City': 'Red Bluff', 'State': 'CA', 'Zip code': '96080'}]}" } ``` #### github ``` { 'text': "/* filesystem.c\n * Filesystem utility routines\n *\n * Wireshark - Network traffic analyzer\n * By Gerald Combs <gerald@wireshark.org>\n * Copyright 1998 Gerald Combs\n *\n * SPDX-License-Identifier: GPL-2.0-or-later\n */\n\n#include <config.h>\n\n#include <stdio.h>\n#include <stdlib.h>\n#include <string.h>\n#include <errno.h>\n\n#include <glib.h>...", 'meta': "{'repo_name': 'wireshark/wireshark', 'stars': '2789', 'repo_language': 'C', 'file_name': 'packet-mpeg-audio-template.c', 'mime_type': 'text/x-c'}" } ``` </details> ### Data Fields #### all - `text` (str): Text. - `meta` (dict): Metadata of the data instance with keys: - pile_set_name: Name of the subset. <details> <summary>Expand to see individual components</summary> #### enron_emails - `text` (str): Text. - `meta` (str): Metadata of the data instance. #### europarl - `text` (str): Text. - `meta` (str): Metadata of the data instance with: language. #### free_law - `text` (str): Text. - `meta` (str): Metadata of the data instance with: case_ID, case_jurisdiction, date_created. #### hacker_news - `text` (str): Text. - `meta` (str): Metadata of the data instance with: id. #### nih_exporter - `text` (str): Text. - `meta` (str): Metadata of the data instance with: APPLICATION_ID. #### pubmed - `text` (str): Text. - `meta` (str): Metadata of the data instance with: pmid, language. #### pubmed_central - `text` (str): Text. - `meta` (str): Metadata of the data instance with: ID of the data instance. #### ubuntu_irc - `text` (str): Text. - `meta` (str): Metadata of the data instance with: channel, month. #### uspto - `text` (str): Text. - `meta` (str): Metadata of the data instance with: bibliographic_information, source_file, abstract, classifications, inventors. #### github - `text` (str): Text. - `meta` (str): Metadata of the data instance with: repo_name, stars, repo_language, file_name, mime_type. ### Data Splits The "all" configuration is composed of 3 splits: train, validation and test. </details> ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators This dataset was primarily curated by Leo Gao and Stella Biderman, with assistance from other authors of the Pile paper. ### Licensing Information Please refer to the specific license depending on the subset you use: - PubMed Central: [MIT License](https://github.com/EleutherAI/pile-pubmedcentral/blob/master/LICENSE) ### Citation Information ``` @article{gao2020pile, title={The {P}ile: An 800{GB} dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others}, journal={arXiv preprint arXiv:2101.00027}, year={2020} } @article{biderman2022datasheet, title={Datasheet for the pile}, author={Biderman, Stella and Bicheno, Kieran and Gao, Leo}, journal={arXiv preprint arXiv:2201.07311}, year={2022} } ``` ### Contributions Thanks to [@github-username](https://github.com/<github-username>) for adding this dataset.

提供机构：

EleutherAI

原始信息汇总

数据集概述

数据集名称

名称: The Pile
别名: the Pile

数据集基本信息

语言: 英语 (EN)
许可证: other
多语言性: 单语种
大小: 100B<n<1T
源数据集: 原始数据
任务类别:
- 文本生成
- 填空
任务ID:
- 语言建模
- 掩码语言建模
论文代码ID: the-pile

数据集结构

数据实例

通用字段:
- text (str): 文本内容。
- meta (dict): 元数据，包含数据实例的特定信息，如pile_set_name（子集名称）。

数据字段详情

enron_emails:
- text (str): 文本内容。
- meta (str): 元数据。
europarl:
- text (str): 文本内容。
- meta (str): 元数据，包含语言信息。
free_law:
- text (str): 文本内容。
- meta (str): 元数据，包含案件管辖、案件ID、创建日期。
hacker_news:
- text (str): 文本内容。
- meta (str): 元数据，包含ID。
nih_exporter:
- text (str): 文本内容。
- meta (str): 元数据，包含APPLICATION_ID。
pubmed:
- text (str): 文本内容。
- meta (str): 元数据，包含pmid、语言。
pubmed_central:
- text (str): 文本内容。
- meta (str): 元数据，包含数据实例的ID。
ubuntu_irc:
- text (str): 文本内容。
- meta (str): 元数据，包含频道、月份。
uspto:
- text (str): 文本内容。
- meta (str): 元数据，包含文献信息、源文件、摘要、分类、发明人。
github:
- text (str): 文本内容。
- meta (str): 元数据，包含仓库名称、星数、仓库语言、文件名、MIME类型。

数据分割

分割: 训练、验证、测试

数据集创建

数据集管理员

主要管理员: Leo Gao, Stella Biderman

许可证信息

PubMed Central: MIT License

引用信息

@article{gao2020pile, title={The {P}ile: An 800{GB} dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others}, journal={arXiv preprint arXiv:2101.00027}, year={2020} }

@article{biderman2022datasheet, title={Datasheet for the pile}, author={Biderman, Stella and Bicheno, Kieran and Gao, Leo}, journal={arXiv preprint arXiv:2201.07311}, year={2022} }

搜集汇总

数据集介绍

构建方式

The Pile 数据集是由 EleutherAI 团队构建的一个多样化的开源语言模型数据集，它由 22 个较小的高质量数据集组合而成。这些数据集涵盖了各种文本类型，包括电子邮件、新闻、科学论文、法律文档、技术文档等。构建该数据集的目的是为了提供一个大规模、多样化的文本数据集，用于语言模型训练和其他自然语言处理任务。

使用方法

使用 The Pile 数据集的方法相对简单。用户可以直接从 Hugging Face 平台下载该数据集，然后使用 Python 等编程语言进行数据处理和分析。由于 The Pile 数据集包含多种文本类型，用户可以根据自己的需求选择合适的数据子集进行使用。此外，The Pile 数据集还提供了详细的文档和示例代码，方便用户快速上手。

背景与挑战

背景概述

语言模型作为自然语言处理领域的关键技术，其性能在很大程度上取决于所使用的数据集的质量和多样性。为了应对这一挑战，EleutherAI团队于2020年创建了名为"The Pile"的数据集，旨在为语言模型提供丰富多样的文本数据。该数据集由22个小型、高质量的子数据集组成，总容量达到825 GiB。"The Pile"数据集的创建，对于推动语言模型的研究和应用具有重要意义，为研究人员提供了高质量的数据资源，有助于提高语言模型的理解和生成能力。

当前挑战

尽管"The Pile"数据集在语言模型领域具有重要价值，但在实际应用中也面临一些挑战。首先，数据集的规模巨大，对于计算资源和存储能力提出了较高要求。其次，数据集的多样性虽然丰富，但也可能导致模型训练过程中出现噪声和不一致性，影响模型的准确性和稳定性。此外，由于数据集的来源多样，可能存在一定的偏见和歧视，需要通过适当的处理和调整来降低这些负面影响。

常用场景

经典使用场景

作为语言模型训练的数据集，The Pile被广泛应用于自然语言处理（NLP）领域，为机器学习模型提供多样化的文本数据。其丰富的数据来源，包括电子邮件、新闻报道、科学文献、代码等，使得The Pile能够支持多种语言建模任务，如文本生成和填空任务。

解决学术问题

The Pile数据集解决了NLP领域中对大规模、高质量文本数据集的需求。它为研究人员提供了丰富的语料库，使得模型能够在多样化的文本环境中进行训练和评估，从而提高了模型的语言理解和生成能力。此外，The Pile还包含了多种语言的数据，为跨语言模型的研究提供了便利。

实际应用

The Pile数据集在实际应用中展现出广泛的价值。例如，在智能客服系统中，The Pile可以帮助模型更好地理解用户的问题，并提供更准确的回答。在文本生成任务中，The Pile可以用于生成高质量的文章、报告等。此外，The Pile还可以用于情感分析、信息提取等任务，为自然语言处理领域的研究和应用提供了有力支持。

数据集最近研究