five

EleutherAI/pile

收藏
hugging_face2023-05-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/EleutherAI/pile
下载链接
链接失效反馈
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - en license: other multilinguality: - monolingual pretty_name: the Pile size_categories: - 100B<n<1T source_datasets: - original task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling paperswithcode_id: the-pile --- # Dataset Card for The Pile ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) This model card is a work in progress. Please also see [our datasheet](https://arxiv.org/abs/2201.07311) for more detailed info. ## Dataset Description - **Homepage:** https://pile.eleuther.ai/ - **Repository:** https://github.com/EleutherAI/the-pile - **Paper:** [The Pile: An 800GB Dataset of Diverse Text for Language Modeling](https://arxiv.org/abs/2101.00027) - **Leaderboard:** - **Point of Contact:** [EleutherAI](mailto:contact@eleuther.ai) - **Datasheet:** [Datasheet for the Pile](https://arxiv.org/abs/2201.07311) ### Dataset Summary The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages This dataset is in English (`EN`). ## Dataset Structure ### Data Instances #### all ``` { 'meta': {'pile_set_name': 'Pile-CC'}, 'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web. Playing on...' } ``` <details> <summary>Expand to see individual components</summary> #### enron_emails ``` { 'text': 'Name\t\t\tNew Title\t\t\t\tEffective Date\t\t\tMid Year promotion Yes/No\n\nFloyd, Jodie\t\tSr Cust Svc Rep (no change)\t\t7/16/01\t\t\t\tNo\n\nBuehler, Craig\t\tSr Mkt/Sup Analyst (no change)\t\t7/16/01\t\t\t\tNo\n\nWagoner, Mike\t\tTeam Advisor - Gas Control\t\t7/1/01\t\t\t\tNo\n\nClapper, Karen\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nGreaney, Chris\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nWilkens, Jerry\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nMinton, Kevin\t\tPipeline Controller\t\t\t8/1/01\t\t\t\tYes\n\nCox, Don\t\tPipeline Controller\t\t\t8/1/01\t\t\t\tYes\n\nHanagriff, Richard\tSr Accounting Control Spec\t\t8/1/01\t\t\t\tYes\n\n\nThanks,\nMS' 'meta': "{}", } ``` #### europarl ``` { 'text': 'Uvádění biocidních přípravků na trh - Nový návrh revize týkající se biocidních přípravků (rozprava) \nPředsedající\nDalším bodem je společná rozprava o následujících tématech:\nzpráva paní Sârbuové za Výbor pro životní prostředí, veřejné zdraví a bezpečnost potravin o návrhu...' 'meta': "{'language': 'cs'}", } ``` #### free_law ``` { 'meta': "{'case_jurisdiction': 'scotus.tar.gz', 'case_ID': '110921.json','date_created': '2010-04-28T17:12:49Z'}", 'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued...' } ``` #### hacker_news ``` { 'text': "\nChina Deserves Donald Trump - rm2889\nhttps://www.nytimes.com/2019/05/21/opinion/china-trump-trade.html\n======\nNotPaidToPost\n> so he’d be wise to curb his nationalistic “no-one-tells-China-what-to-do”\n> bluster\n\nThis comment highlights both ignorance of Chinese history and continuing\nAmerican arrogance.\n\nChina has been painfully dictated what to do during the last 200 years. This\nhas had a profound effect on the country and has led to the collapse of\nimperial rule and the drive to 'rejuvenate'...", 'meta': "{'id': '19979654'}", } ``` #### nih_exporter ``` { 'text': "The National Domestic Violence Hotline (NDVH) and the National Dating Abuse Helpline (NDAH), which are supported by the Division of Family Violence Prevention and Services within the Family and Youth Services Bureau, serve as critical partners in the intervention, prevention, and resource assistance efforts of the network of family violence, domestic violence, and dating violence service providers. They provide crisis intervention and support services; information about resources on domestic...", 'meta': " {'APPLICATION_ID': 100065}", } ``` #### pubmed ``` { 'meta': {'pmid': 11409574, 'language': 'eng'}, 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that included 4,021 children under 5 with acute respiratory infections (ARI) and reported the prevalence of hypoxaemia. Out-patient children and those with a clinical diagnosis of upper ARI had a low risk of hypoxaemia (pooled estimate of 6% to 9%). The prevalence increased to 31% and to 43% in patients in emergency departments and in cases with clinical pneumonia, respectively, and it was even higher among hospitalised children (47%) and in those with radiographically confirmed pneumonia (72%). The cumulated data also suggest that hypoxaemia is more frequent in children living at high altitude. Three papers reported an association between hypoxaemia and death, with relative risks varying between 1.4 and 4.6. Papers describing predictors of hypoxaemia have focused on clinical signs for detecting hypoxaemia rather than on identifying risk factors for developing this complication. Hypoxaemia is a common and potentially lethal complication of ALRI in children under 5, particularly among those with severe disease and those living at high altitude. Given the observed high prevalence of hypoxaemia and its likely association with increased mortality, efforts should be made to improve the detection of hypoxaemia and to provide oxygen earlier to more children with severe ALRI.' } ``` #### pubmed_central ``` { 'meta': "{id': 'PMC5595690'}", 'text': 'Introduction {#acel12642-sec-0001}\n============\n\nAlzheimer\\\'s disease (AD), the most common cause of...' } ``` #### ubuntu_irc ``` { 'text': "#ubuntu 2004-07-05\n* Window 3\n* \tServer: [0] <None>\n* \tScreen: 0x817e90c\n* \tGeometry Info: [0 11 0 11 11 11] \n* \tCO, LI are [94 49] \n* \tCurrent channel: #ubuntu\n* \tQuery User: <None> \n*\tPrompt: <None>\n* \tSecond status line is OFF\n* \tSplit line is ON triple is OFF\n* \tLogging is ON\n* \tLogfile is irclogs/ubuntu.log\n* \tNotification is OFF\n* \tHold mode is OFF\n* \tWindow level is NONE\n* \tLastlog level is ALL\n* \tNotify level is ALL\n<mdz> lifeless: using tla effectively for all packages in Warty requ...", 'meta': "{'channel': 'ubuntu', 'month': 7}" } ``` #### uspto ``` { 'text': "1. Field of the Invention\nIn an extensive plant breeding program, Grant Merrill, originator and now deceased, originated a large number of new and distinct varieties of fruit trees, and which included the herein-claimed variety of peach tree. Such plant breeding program was undertaken in originator's experimental orchard located near Exeter, Tulare County, Calif.\n2. Prior Varieties\nAmong the existent varieties of peach trees which were known to originator, particular reference is made to Gemfree (U.S. Plant Pat. No. 1,409) and June Lady (U.S. Plant Pat. No. 3,022) hereinafter mentioned for the purpose of comparison.", 'meta': "{'bibliographic_information': {'Patent Number': 'PP0049700', 'Series Code': '6', 'Application Number': '2845415', 'Application Type': '6', 'Art unit': '337', 'Application Filing Date': '19810720', 'Title of Invention': 'Peach tree (A3-10)', 'Issue Date': '19830104', 'Number of Claims': '1', 'Exemplary Claim Number(s)': '1', 'Primary Examiner': 'Bagwill; Robert E.', 'Number of Drawing Sheets': '1', 'Number of figures': '1'}, 'source_file': 'https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/1983/pftaps19830104_wk01.zip', 'abstract': 'A peach tree which is large, vigorous, and spreading; foliated with large, lanceolate leaves having a finely serrate margin, a petiole of medium length and thickness, and medium size, reniform glands; blooms from medium size, conic, plump, pubescent buds; the flowers, medium in blooming period compared with other varieties, being of medium size, and pink; and is a regular and very productive bearer of medium but variable size, round truncate, clingstone fruit having yellow skin substantially overspread with red, yellow flesh mottled with red adjacent the skin, and an amber stone.', 'classifications': [{'OCL': ['Plt', '43'], 'EDF': ['3'], 'ICL': ['A01H', '503'], 'FSC': ['Plt'], 'FSS': ['43']}], 'inventors': [{'inventor name': 'Merrill, deceased; Grant', 'Street': '325 Breese Ave.', 'City': 'late of Red Bluff', 'State': 'CA'}, {'inventor name': 'Merrill, executrix; by Lucile B.', 'Street': '325 Breese Ave.', 'City': 'Red Bluff', 'State': 'CA', 'Zip code': '96080'}]}" } ``` #### github ``` { 'text': "/* filesystem.c\n * Filesystem utility routines\n *\n * Wireshark - Network traffic analyzer\n * By Gerald Combs <gerald@wireshark.org>\n * Copyright 1998 Gerald Combs\n *\n * SPDX-License-Identifier: GPL-2.0-or-later\n */\n\n#include <config.h>\n\n#include <stdio.h>\n#include <stdlib.h>\n#include <string.h>\n#include <errno.h>\n\n#include <glib.h>...", 'meta': "{'repo_name': 'wireshark/wireshark', 'stars': '2789', 'repo_language': 'C', 'file_name': 'packet-mpeg-audio-template.c', 'mime_type': 'text/x-c'}" } ``` </details> ### Data Fields #### all - `text` (str): Text. - `meta` (dict): Metadata of the data instance with keys: - pile_set_name: Name of the subset. <details> <summary>Expand to see individual components</summary> #### enron_emails - `text` (str): Text. - `meta` (str): Metadata of the data instance. #### europarl - `text` (str): Text. - `meta` (str): Metadata of the data instance with: language. #### free_law - `text` (str): Text. - `meta` (str): Metadata of the data instance with: case_ID, case_jurisdiction, date_created. #### hacker_news - `text` (str): Text. - `meta` (str): Metadata of the data instance with: id. #### nih_exporter - `text` (str): Text. - `meta` (str): Metadata of the data instance with: APPLICATION_ID. #### pubmed - `text` (str): Text. - `meta` (str): Metadata of the data instance with: pmid, language. #### pubmed_central - `text` (str): Text. - `meta` (str): Metadata of the data instance with: ID of the data instance. #### ubuntu_irc - `text` (str): Text. - `meta` (str): Metadata of the data instance with: channel, month. #### uspto - `text` (str): Text. - `meta` (str): Metadata of the data instance with: bibliographic_information, source_file, abstract, classifications, inventors. #### github - `text` (str): Text. - `meta` (str): Metadata of the data instance with: repo_name, stars, repo_language, file_name, mime_type. ### Data Splits The "all" configuration is composed of 3 splits: train, validation and test. </details> ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators This dataset was primarily curated by Leo Gao and Stella Biderman, with assistance from other authors of the Pile paper. ### Licensing Information Please refer to the specific license depending on the subset you use: - PubMed Central: [MIT License](https://github.com/EleutherAI/pile-pubmedcentral/blob/master/LICENSE) ### Citation Information ``` @article{gao2020pile, title={The {P}ile: An 800{GB} dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others}, journal={arXiv preprint arXiv:2101.00027}, year={2020} } @article{biderman2022datasheet, title={Datasheet for the pile}, author={Biderman, Stella and Bicheno, Kieran and Gao, Leo}, journal={arXiv preprint arXiv:2201.07311}, year={2022} } ``` ### Contributions Thanks to [@github-username](https://github.com/<github-username>) for adding this dataset.
提供机构:
EleutherAI
原始信息汇总

数据集概述

数据集名称

  • 名称: The Pile
  • 别名: the Pile

数据集基本信息

  • 语言: 英语 (EN)
  • 许可证: other
  • 多语言性: 单语种
  • 大小: 100B<n<1T
  • 源数据集: 原始数据
  • 任务类别:
    • 文本生成
    • 填空
  • 任务ID:
    • 语言建模
    • 掩码语言建模
  • 论文代码ID: the-pile

数据集结构

数据实例

  • 通用字段:
    • text (str): 文本内容。
    • meta (dict): 元数据,包含数据实例的特定信息,如pile_set_name(子集名称)。

数据字段详情

  • enron_emails:
    • text (str): 文本内容。
    • meta (str): 元数据。
  • europarl:
    • text (str): 文本内容。
    • meta (str): 元数据,包含语言信息。
  • free_law:
    • text (str): 文本内容。
    • meta (str): 元数据,包含案件管辖、案件ID、创建日期。
  • hacker_news:
    • text (str): 文本内容。
    • meta (str): 元数据,包含ID。
  • nih_exporter:
    • text (str): 文本内容。
    • meta (str): 元数据,包含APPLICATION_ID。
  • pubmed:
    • text (str): 文本内容。
    • meta (str): 元数据,包含pmid、语言。
  • pubmed_central:
    • text (str): 文本内容。
    • meta (str): 元数据,包含数据实例的ID。
  • ubuntu_irc:
    • text (str): 文本内容。
    • meta (str): 元数据,包含频道、月份。
  • uspto:
    • text (str): 文本内容。
    • meta (str): 元数据,包含文献信息、源文件、摘要、分类、发明人。
  • github:
    • text (str): 文本内容。
    • meta (str): 元数据,包含仓库名称、星数、仓库语言、文件名、MIME类型。

数据分割

  • 分割: 训练、验证、测试

数据集创建

数据集管理员

  • 主要管理员: Leo Gao, Stella Biderman

许可证信息

  • PubMed Central: MIT License

引用信息

@article{gao2020pile, title={The {P}ile: An 800{GB} dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others}, journal={arXiv preprint arXiv:2101.00027}, year={2020} }

@article{biderman2022datasheet, title={Datasheet for the pile}, author={Biderman, Stella and Bicheno, Kieran and Gao, Leo}, journal={arXiv preprint arXiv:2201.07311}, year={2022} }

AI搜集汇总
数据集介绍
main_image_url
构建方式
The Pile 数据集是由 EleutherAI 团队构建的一个多样化的开源语言模型数据集,它由 22 个较小的高质量数据集组合而成。这些数据集涵盖了各种文本类型,包括电子邮件、新闻、科学论文、法律文档、技术文档等。构建该数据集的目的是为了提供一个大规模、多样化的文本数据集,用于语言模型训练和其他自然语言处理任务。
使用方法
使用 The Pile 数据集的方法相对简单。用户可以直接从 Hugging Face 平台下载该数据集,然后使用 Python 等编程语言进行数据处理和分析。由于 The Pile 数据集包含多种文本类型,用户可以根据自己的需求选择合适的数据子集进行使用。此外,The Pile 数据集还提供了详细的文档和示例代码,方便用户快速上手。
背景与挑战
背景概述
语言模型作为自然语言处理领域的关键技术,其性能在很大程度上取决于所使用的数据集的质量和多样性。为了应对这一挑战,EleutherAI团队于2020年创建了名为"The Pile"的数据集,旨在为语言模型提供丰富多样的文本数据。该数据集由22个小型、高质量的子数据集组成,总容量达到825 GiB。"The Pile"数据集的创建,对于推动语言模型的研究和应用具有重要意义,为研究人员提供了高质量的数据资源,有助于提高语言模型的理解和生成能力。
当前挑战
尽管"The Pile"数据集在语言模型领域具有重要价值,但在实际应用中也面临一些挑战。首先,数据集的规模巨大,对于计算资源和存储能力提出了较高要求。其次,数据集的多样性虽然丰富,但也可能导致模型训练过程中出现噪声和不一致性,影响模型的准确性和稳定性。此外,由于数据集的来源多样,可能存在一定的偏见和歧视,需要通过适当的处理和调整来降低这些负面影响。
常用场景
经典使用场景
作为语言模型训练的数据集,The Pile被广泛应用于自然语言处理(NLP)领域,为机器学习模型提供多样化的文本数据。其丰富的数据来源,包括电子邮件、新闻报道、科学文献、代码等,使得The Pile能够支持多种语言建模任务,如文本生成和填空任务。
解决学术问题
The Pile数据集解决了NLP领域中对大规模、高质量文本数据集的需求。它为研究人员提供了丰富的语料库,使得模型能够在多样化的文本环境中进行训练和评估,从而提高了模型的语言理解和生成能力。此外,The Pile还包含了多种语言的数据,为跨语言模型的研究提供了便利。
实际应用
The Pile数据集在实际应用中展现出广泛的价值。例如,在智能客服系统中,The Pile可以帮助模型更好地理解用户的问题,并提供更准确的回答。在文本生成任务中,The Pile可以用于生成高质量的文章、报告等。此外,The Pile还可以用于情感分析、信息提取等任务,为自然语言处理领域的研究和应用提供了有力支持。
数据集最近研究
最新研究方向
EleutherAI的Pile数据集是一个庞大的、多样化的文本语料库,旨在为语言建模提供丰富的训练材料。该数据集的最新研究方向主要集中在如何有效利用其多样性来提高语言模型的质量和泛化能力。研究者们正在探索如何通过Pile数据集训练出能够更好地理解和生成自然语言的模型,特别是在面对不同领域和风格时。此外,还有研究关注于如何从Pile中提取有价值的语言特征,以增强模型在特定任务上的表现,如文本生成、填空等。随着自然语言处理技术的不断发展,Pile数据集有望在推动语言模型研究方面发挥重要作用,并为实际应用带来更多可能性。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

MultiTalk

MultiTalk数据集是由韩国科学技术院创建,包含超过420小时的2D视频,涵盖20种不同语言,旨在解决多语言环境下3D说话头生成的问题。该数据集通过自动化管道从YouTube收集,每段视频都配有语言标签和伪转录,部分视频还包含伪3D网格顶点。数据集的创建过程包括视频收集、主动说话者验证和正面人脸验证,确保数据质量。MultiTalk数据集的应用领域主要集中在提升多语言3D说话头生成的准确性和表现力,通过引入语言特定风格嵌入,使模型能够捕捉每种语言独特的嘴部运动。

arXiv 收录

AIS数据集

该研究使用了多个公开的AIS数据集,这些数据集经过过滤、清理和统计分析。数据集涵盖了多种类型的船舶,并提供了关于船舶位置、速度和航向的关键信息。数据集包括来自19,185艘船舶的AIS消息,总计约6.4亿条记录。

github 收录

CMAB

CMAB数据集由清华大学创建,是中国首个全国范围的多属性建筑数据集,涵盖了3667个自然城市,总面积达213亿平方米。该数据集通过集成多源数据,如高分辨率Google Earth影像和街景图像,生成了建筑的屋顶、高度、功能、年龄和质量等属性。数据集的创建过程结合了地理人工智能框架和机器学习模型,确保了数据的高准确性。CMAB数据集主要应用于城市规划和可持续发展研究,旨在提供详细的城市3D物理和社会结构信息,支持城市化进程和政府决策。

arXiv 收录

CE-CSL

CE-CSL数据集是由哈尔滨工程大学智能科学与工程学院创建的中文连续手语数据集,旨在解决现有数据集在复杂环境下的局限性。该数据集包含5,988个从日常生活场景中收集的连续手语视频片段,涵盖超过70种不同的复杂背景,确保了数据集的代表性和泛化能力。数据集的创建过程严格遵循实际应用导向,通过收集大量真实场景下的手语视频材料,覆盖了广泛的情境变化和环境复杂性。CE-CSL数据集主要应用于连续手语识别领域,旨在提高手语识别技术在复杂环境中的准确性和效率,促进聋人与听人社区之间的无障碍沟通。

arXiv 收录

CatMeows

该数据集包含440个声音样本,由21只属于两个品种(缅因州库恩猫和欧洲短毛猫)的猫在三种不同情境下发出的喵声组成。这些情境包括刷毛、在陌生环境中隔离和等待食物。每个声音文件都遵循特定的命名约定,包含猫的唯一ID、品种、性别、猫主人的唯一ID、录音场次和发声计数。此外,还有一个额外的zip文件,包含被排除的录音(非喵声)和未剪辑的连续发声序列。

huggingface 收录