five

EleutherAI/pile

收藏
hugging_face2023-05-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/EleutherAI/pile
下载链接
链接失效反馈
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - en license: other multilinguality: - monolingual pretty_name: the Pile size_categories: - 100B<n<1T source_datasets: - original task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling paperswithcode_id: the-pile --- # Dataset Card for The Pile ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) This model card is a work in progress. Please also see [our datasheet](https://arxiv.org/abs/2201.07311) for more detailed info. ## Dataset Description - **Homepage:** https://pile.eleuther.ai/ - **Repository:** https://github.com/EleutherAI/the-pile - **Paper:** [The Pile: An 800GB Dataset of Diverse Text for Language Modeling](https://arxiv.org/abs/2101.00027) - **Leaderboard:** - **Point of Contact:** [EleutherAI](mailto:contact@eleuther.ai) - **Datasheet:** [Datasheet for the Pile](https://arxiv.org/abs/2201.07311) ### Dataset Summary The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages This dataset is in English (`EN`). ## Dataset Structure ### Data Instances #### all ``` { 'meta': {'pile_set_name': 'Pile-CC'}, 'text': 'It is done, and submitted. You can play “Survival of the Tastiest” on Android, and on the web. Playing on...' } ``` <details> <summary>Expand to see individual components</summary> #### enron_emails ``` { 'text': 'Name\t\t\tNew Title\t\t\t\tEffective Date\t\t\tMid Year promotion Yes/No\n\nFloyd, Jodie\t\tSr Cust Svc Rep (no change)\t\t7/16/01\t\t\t\tNo\n\nBuehler, Craig\t\tSr Mkt/Sup Analyst (no change)\t\t7/16/01\t\t\t\tNo\n\nWagoner, Mike\t\tTeam Advisor - Gas Control\t\t7/1/01\t\t\t\tNo\n\nClapper, Karen\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nGreaney, Chris\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nWilkens, Jerry\t\tSr Cust Svc Rep\t\t\t8/1/01\t\t\t\tYes\n\nMinton, Kevin\t\tPipeline Controller\t\t\t8/1/01\t\t\t\tYes\n\nCox, Don\t\tPipeline Controller\t\t\t8/1/01\t\t\t\tYes\n\nHanagriff, Richard\tSr Accounting Control Spec\t\t8/1/01\t\t\t\tYes\n\n\nThanks,\nMS' 'meta': "{}", } ``` #### europarl ``` { 'text': 'Uvádění biocidních přípravků na trh - Nový návrh revize týkající se biocidních přípravků (rozprava) \nPředsedající\nDalším bodem je společná rozprava o následujících tématech:\nzpráva paní Sârbuové za Výbor pro životní prostředí, veřejné zdraví a bezpečnost potravin o návrhu...' 'meta': "{'language': 'cs'}", } ``` #### free_law ``` { 'meta': "{'case_jurisdiction': 'scotus.tar.gz', 'case_ID': '110921.json','date_created': '2010-04-28T17:12:49Z'}", 'text': '\n461 U.S. 238 (1983)\nOLIM ET AL.\nv.\nWAKINEKONA\nNo. 81-1581.\nSupreme Court of United States.\nArgued...' } ``` #### hacker_news ``` { 'text': "\nChina Deserves Donald Trump - rm2889\nhttps://www.nytimes.com/2019/05/21/opinion/china-trump-trade.html\n======\nNotPaidToPost\n> so he’d be wise to curb his nationalistic “no-one-tells-China-what-to-do”\n> bluster\n\nThis comment highlights both ignorance of Chinese history and continuing\nAmerican arrogance.\n\nChina has been painfully dictated what to do during the last 200 years. This\nhas had a profound effect on the country and has led to the collapse of\nimperial rule and the drive to 'rejuvenate'...", 'meta': "{'id': '19979654'}", } ``` #### nih_exporter ``` { 'text': "The National Domestic Violence Hotline (NDVH) and the National Dating Abuse Helpline (NDAH), which are supported by the Division of Family Violence Prevention and Services within the Family and Youth Services Bureau, serve as critical partners in the intervention, prevention, and resource assistance efforts of the network of family violence, domestic violence, and dating violence service providers. They provide crisis intervention and support services; information about resources on domestic...", 'meta': " {'APPLICATION_ID': 100065}", } ``` #### pubmed ``` { 'meta': {'pmid': 11409574, 'language': 'eng'}, 'text': 'Epidemiology of hypoxaemia in children with acute lower respiratory infection.\nTo determine the prevalence of hypoxaemia in children aged under 5 years suffering acute lower respiratory infections (ALRI), the risk factors for hypoxaemia in children under 5 years of age with ALRI, and the association of hypoxaemia with an increased risk of dying in children of the same age. Systematic review of the published literature. Out-patient clinics, emergency departments and hospitalisation wards in 23 health centres from 10 countries. Cohort studies reporting the frequency of hypoxaemia in children under 5 years of age with ALRI, and the association between hypoxaemia and the risk of dying. Prevalence of hypoxaemia measured in children with ARI and relative risks for the association between the severity of illness and the frequency of hypoxaemia, and between hypoxaemia and the risk of dying. Seventeen published studies were found that included 4,021 children under 5 with acute respiratory infections (ARI) and reported the prevalence of hypoxaemia. Out-patient children and those with a clinical diagnosis of upper ARI had a low risk of hypoxaemia (pooled estimate of 6% to 9%). The prevalence increased to 31% and to 43% in patients in emergency departments and in cases with clinical pneumonia, respectively, and it was even higher among hospitalised children (47%) and in those with radiographically confirmed pneumonia (72%). The cumulated data also suggest that hypoxaemia is more frequent in children living at high altitude. Three papers reported an association between hypoxaemia and death, with relative risks varying between 1.4 and 4.6. Papers describing predictors of hypoxaemia have focused on clinical signs for detecting hypoxaemia rather than on identifying risk factors for developing this complication. Hypoxaemia is a common and potentially lethal complication of ALRI in children under 5, particularly among those with severe disease and those living at high altitude. Given the observed high prevalence of hypoxaemia and its likely association with increased mortality, efforts should be made to improve the detection of hypoxaemia and to provide oxygen earlier to more children with severe ALRI.' } ``` #### pubmed_central ``` { 'meta': "{id': 'PMC5595690'}", 'text': 'Introduction {#acel12642-sec-0001}\n============\n\nAlzheimer\\\'s disease (AD), the most common cause of...' } ``` #### ubuntu_irc ``` { 'text': "#ubuntu 2004-07-05\n* Window 3\n* \tServer: [0] <None>\n* \tScreen: 0x817e90c\n* \tGeometry Info: [0 11 0 11 11 11] \n* \tCO, LI are [94 49] \n* \tCurrent channel: #ubuntu\n* \tQuery User: <None> \n*\tPrompt: <None>\n* \tSecond status line is OFF\n* \tSplit line is ON triple is OFF\n* \tLogging is ON\n* \tLogfile is irclogs/ubuntu.log\n* \tNotification is OFF\n* \tHold mode is OFF\n* \tWindow level is NONE\n* \tLastlog level is ALL\n* \tNotify level is ALL\n<mdz> lifeless: using tla effectively for all packages in Warty requ...", 'meta': "{'channel': 'ubuntu', 'month': 7}" } ``` #### uspto ``` { 'text': "1. Field of the Invention\nIn an extensive plant breeding program, Grant Merrill, originator and now deceased, originated a large number of new and distinct varieties of fruit trees, and which included the herein-claimed variety of peach tree. Such plant breeding program was undertaken in originator's experimental orchard located near Exeter, Tulare County, Calif.\n2. Prior Varieties\nAmong the existent varieties of peach trees which were known to originator, particular reference is made to Gemfree (U.S. Plant Pat. No. 1,409) and June Lady (U.S. Plant Pat. No. 3,022) hereinafter mentioned for the purpose of comparison.", 'meta': "{'bibliographic_information': {'Patent Number': 'PP0049700', 'Series Code': '6', 'Application Number': '2845415', 'Application Type': '6', 'Art unit': '337', 'Application Filing Date': '19810720', 'Title of Invention': 'Peach tree (A3-10)', 'Issue Date': '19830104', 'Number of Claims': '1', 'Exemplary Claim Number(s)': '1', 'Primary Examiner': 'Bagwill; Robert E.', 'Number of Drawing Sheets': '1', 'Number of figures': '1'}, 'source_file': 'https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/1983/pftaps19830104_wk01.zip', 'abstract': 'A peach tree which is large, vigorous, and spreading; foliated with large, lanceolate leaves having a finely serrate margin, a petiole of medium length and thickness, and medium size, reniform glands; blooms from medium size, conic, plump, pubescent buds; the flowers, medium in blooming period compared with other varieties, being of medium size, and pink; and is a regular and very productive bearer of medium but variable size, round truncate, clingstone fruit having yellow skin substantially overspread with red, yellow flesh mottled with red adjacent the skin, and an amber stone.', 'classifications': [{'OCL': ['Plt', '43'], 'EDF': ['3'], 'ICL': ['A01H', '503'], 'FSC': ['Plt'], 'FSS': ['43']}], 'inventors': [{'inventor name': 'Merrill, deceased; Grant', 'Street': '325 Breese Ave.', 'City': 'late of Red Bluff', 'State': 'CA'}, {'inventor name': 'Merrill, executrix; by Lucile B.', 'Street': '325 Breese Ave.', 'City': 'Red Bluff', 'State': 'CA', 'Zip code': '96080'}]}" } ``` #### github ``` { 'text': "/* filesystem.c\n * Filesystem utility routines\n *\n * Wireshark - Network traffic analyzer\n * By Gerald Combs <gerald@wireshark.org>\n * Copyright 1998 Gerald Combs\n *\n * SPDX-License-Identifier: GPL-2.0-or-later\n */\n\n#include <config.h>\n\n#include <stdio.h>\n#include <stdlib.h>\n#include <string.h>\n#include <errno.h>\n\n#include <glib.h>...", 'meta': "{'repo_name': 'wireshark/wireshark', 'stars': '2789', 'repo_language': 'C', 'file_name': 'packet-mpeg-audio-template.c', 'mime_type': 'text/x-c'}" } ``` </details> ### Data Fields #### all - `text` (str): Text. - `meta` (dict): Metadata of the data instance with keys: - pile_set_name: Name of the subset. <details> <summary>Expand to see individual components</summary> #### enron_emails - `text` (str): Text. - `meta` (str): Metadata of the data instance. #### europarl - `text` (str): Text. - `meta` (str): Metadata of the data instance with: language. #### free_law - `text` (str): Text. - `meta` (str): Metadata of the data instance with: case_ID, case_jurisdiction, date_created. #### hacker_news - `text` (str): Text. - `meta` (str): Metadata of the data instance with: id. #### nih_exporter - `text` (str): Text. - `meta` (str): Metadata of the data instance with: APPLICATION_ID. #### pubmed - `text` (str): Text. - `meta` (str): Metadata of the data instance with: pmid, language. #### pubmed_central - `text` (str): Text. - `meta` (str): Metadata of the data instance with: ID of the data instance. #### ubuntu_irc - `text` (str): Text. - `meta` (str): Metadata of the data instance with: channel, month. #### uspto - `text` (str): Text. - `meta` (str): Metadata of the data instance with: bibliographic_information, source_file, abstract, classifications, inventors. #### github - `text` (str): Text. - `meta` (str): Metadata of the data instance with: repo_name, stars, repo_language, file_name, mime_type. ### Data Splits The "all" configuration is composed of 3 splits: train, validation and test. </details> ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators This dataset was primarily curated by Leo Gao and Stella Biderman, with assistance from other authors of the Pile paper. ### Licensing Information Please refer to the specific license depending on the subset you use: - PubMed Central: [MIT License](https://github.com/EleutherAI/pile-pubmedcentral/blob/master/LICENSE) ### Citation Information ``` @article{gao2020pile, title={The {P}ile: An 800{GB} dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others}, journal={arXiv preprint arXiv:2101.00027}, year={2020} } @article{biderman2022datasheet, title={Datasheet for the pile}, author={Biderman, Stella and Bicheno, Kieran and Gao, Leo}, journal={arXiv preprint arXiv:2201.07311}, year={2022} } ``` ### Contributions Thanks to [@github-username](https://github.com/<github-username>) for adding this dataset.
提供机构:
EleutherAI
原始信息汇总

数据集概述

数据集名称

  • 名称: The Pile
  • 别名: the Pile

数据集基本信息

  • 语言: 英语 (EN)
  • 许可证: other
  • 多语言性: 单语种
  • 大小: 100B<n<1T
  • 源数据集: 原始数据
  • 任务类别:
    • 文本生成
    • 填空
  • 任务ID:
    • 语言建模
    • 掩码语言建模
  • 论文代码ID: the-pile

数据集结构

数据实例

  • 通用字段:
    • text (str): 文本内容。
    • meta (dict): 元数据,包含数据实例的特定信息,如pile_set_name(子集名称)。

数据字段详情

  • enron_emails:
    • text (str): 文本内容。
    • meta (str): 元数据。
  • europarl:
    • text (str): 文本内容。
    • meta (str): 元数据,包含语言信息。
  • free_law:
    • text (str): 文本内容。
    • meta (str): 元数据,包含案件管辖、案件ID、创建日期。
  • hacker_news:
    • text (str): 文本内容。
    • meta (str): 元数据,包含ID。
  • nih_exporter:
    • text (str): 文本内容。
    • meta (str): 元数据,包含APPLICATION_ID。
  • pubmed:
    • text (str): 文本内容。
    • meta (str): 元数据,包含pmid、语言。
  • pubmed_central:
    • text (str): 文本内容。
    • meta (str): 元数据,包含数据实例的ID。
  • ubuntu_irc:
    • text (str): 文本内容。
    • meta (str): 元数据,包含频道、月份。
  • uspto:
    • text (str): 文本内容。
    • meta (str): 元数据,包含文献信息、源文件、摘要、分类、发明人。
  • github:
    • text (str): 文本内容。
    • meta (str): 元数据,包含仓库名称、星数、仓库语言、文件名、MIME类型。

数据分割

  • 分割: 训练、验证、测试

数据集创建

数据集管理员

  • 主要管理员: Leo Gao, Stella Biderman

许可证信息

  • PubMed Central: MIT License

引用信息

@article{gao2020pile, title={The {P}ile: An 800{GB} dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others}, journal={arXiv preprint arXiv:2101.00027}, year={2020} }

@article{biderman2022datasheet, title={Datasheet for the pile}, author={Biderman, Stella and Bicheno, Kieran and Gao, Leo}, journal={arXiv preprint arXiv:2201.07311}, year={2022} }

AI搜集汇总
数据集介绍
main_image_url
构建方式
The Pile 数据集是由 EleutherAI 团队构建的一个多样化的开源语言模型数据集,它由 22 个较小的高质量数据集组合而成。这些数据集涵盖了各种文本类型,包括电子邮件、新闻、科学论文、法律文档、技术文档等。构建该数据集的目的是为了提供一个大规模、多样化的文本数据集,用于语言模型训练和其他自然语言处理任务。
使用方法
使用 The Pile 数据集的方法相对简单。用户可以直接从 Hugging Face 平台下载该数据集,然后使用 Python 等编程语言进行数据处理和分析。由于 The Pile 数据集包含多种文本类型,用户可以根据自己的需求选择合适的数据子集进行使用。此外,The Pile 数据集还提供了详细的文档和示例代码,方便用户快速上手。
背景与挑战
背景概述
语言模型作为自然语言处理领域的关键技术,其性能在很大程度上取决于所使用的数据集的质量和多样性。为了应对这一挑战,EleutherAI团队于2020年创建了名为"The Pile"的数据集,旨在为语言模型提供丰富多样的文本数据。该数据集由22个小型、高质量的子数据集组成,总容量达到825 GiB。"The Pile"数据集的创建,对于推动语言模型的研究和应用具有重要意义,为研究人员提供了高质量的数据资源,有助于提高语言模型的理解和生成能力。
当前挑战
尽管"The Pile"数据集在语言模型领域具有重要价值,但在实际应用中也面临一些挑战。首先,数据集的规模巨大,对于计算资源和存储能力提出了较高要求。其次,数据集的多样性虽然丰富,但也可能导致模型训练过程中出现噪声和不一致性,影响模型的准确性和稳定性。此外,由于数据集的来源多样,可能存在一定的偏见和歧视,需要通过适当的处理和调整来降低这些负面影响。
常用场景
经典使用场景
作为语言模型训练的数据集,The Pile被广泛应用于自然语言处理(NLP)领域,为机器学习模型提供多样化的文本数据。其丰富的数据来源,包括电子邮件、新闻报道、科学文献、代码等,使得The Pile能够支持多种语言建模任务,如文本生成和填空任务。
解决学术问题
The Pile数据集解决了NLP领域中对大规模、高质量文本数据集的需求。它为研究人员提供了丰富的语料库,使得模型能够在多样化的文本环境中进行训练和评估,从而提高了模型的语言理解和生成能力。此外,The Pile还包含了多种语言的数据,为跨语言模型的研究提供了便利。
实际应用
The Pile数据集在实际应用中展现出广泛的价值。例如,在智能客服系统中,The Pile可以帮助模型更好地理解用户的问题,并提供更准确的回答。在文本生成任务中,The Pile可以用于生成高质量的文章、报告等。此外,The Pile还可以用于情感分析、信息提取等任务,为自然语言处理领域的研究和应用提供了有力支持。
数据集最近研究
最新研究方向
EleutherAI的Pile数据集是一个庞大的、多样化的文本语料库,旨在为语言建模提供丰富的训练材料。该数据集的最新研究方向主要集中在如何有效利用其多样性来提高语言模型的质量和泛化能力。研究者们正在探索如何通过Pile数据集训练出能够更好地理解和生成自然语言的模型,特别是在面对不同领域和风格时。此外,还有研究关注于如何从Pile中提取有价值的语言特征,以增强模型在特定任务上的表现,如文本生成、填空等。随着自然语言处理技术的不断发展,Pile数据集有望在推动语言模型研究方面发挥重要作用,并为实际应用带来更多可能性。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

中国气象数据

本数据集包含了中国2023年1月至11月的气象数据,包括日照时间、降雨量、温度、风速等关键数据。通过这些数据,可以深入了解气象现象对不同地区的影响,并通过可视化工具揭示中国的气温分布、降水情况、风速趋势等。

github 收录

长江干流实时水位观测数据集(2024年)

该数据集为长江干流主要水文站实时水位观测数据集,包含了汉口、户口、九江、宜昌等16个水文站点的逐小时或逐日水位观测数据。 该数据集包含3个excel表格文件,长江干流站点.xls,逐日水位.xlsx,逐小时水位.xlsx。

国家地球系统科学数据中心 收录

中国食物成分数据库

食物成分数据比较准确而详细地描述农作物、水产类、畜禽肉类等人类赖以生存的基本食物的品质和营养成分含量。它是一个重要的我国公共卫生数据和营养信息资源,是提供人类基本需求和基本社会保障的先决条件;也是一个国家制定相关法规标准、实施有关营养政策、开展食品贸易和进行营养健康教育的基础,兼具学术、经济、社会等多种价值。 本数据集收录了基于2002年食物成分表的1506条食物的31项营养成分(含胆固醇)数据,657条食物的18种氨基酸数据、441条食物的32种脂肪酸数据、130条食物的碘数据、114条食物的大豆异黄酮数据。

国家人口健康科学数据中心 收录

MedChain

MedChain是由香港城市大学、香港中文大学、深圳大学、阳明交通大学和台北荣民总医院联合创建的临床决策数据集,包含12,163个临床案例,涵盖19个医学专科和156个子类别。数据集通过五个关键阶段模拟临床工作流程,强调个性化、互动性和顺序性。数据来源于中国医疗网站“iiYi”,经过专业医生验证和去识别化处理,确保数据质量和患者隐私。MedChain旨在评估大型语言模型在真实临床场景中的诊断能力,解决现有基准在个性化医疗、互动咨询和顺序决策方面的不足。

arXiv 收录

UIEB, U45, LSUI

本仓库提供了水下图像增强方法和数据集的实现,包括UIEB、U45和LSUI等数据集,用于支持水下图像增强的研究和开发。

github 收录