five

Pile-OpenWebText2

收藏
魔搭社区2025-12-12 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/OmniData/Pile-OpenWebText2
下载链接
链接失效反馈
官方服务:
资源简介:
displayName: Pile-OpenWebText2 labelTypes: - English Corpus license: - MIT mediaTypes: - Text paperUrl: "" publishDate: "2023-07-18" publishUrl: https://pile.eleuther.ai/ publisher: - EleutherAI tags: [] taskTypes: - Natural Language Generation - Language Modelling --- # 数据集介绍 ## 简介 Pile-OpenWebText2是EleutherAI/The Pile数据集的一部分,它是原始OpenWebTextCorpus的增强版本,是一个多样化、开源的语言建模数据集。 ## 数据内容 ### 数据说明 Pile-OpenWebText2涵盖了56.8G的数据。 ### 数据示例 ``` { "id": "158625874", "source_id": "", "doc_id": "26047531", "data_type": "text", "data_source": "pile", "data_url": "enwiki-c4-pile-ccnews", "content": "COVID-19: 11 who returned to Bidar from Tablighi Jamaat event test positive, 1000 screened\n\nExpress News Service |\n\nPublished: 02nd April 2020 10:21 AM\n\nFor representational purposes\n\nBIDAR: Eleven people from Bidar who took part in the Tablighi Jamaat event held in Delhi tested positive for COVID-19 on Thursday.\n\nDeputy Commissioner of Bidar HR Mahadev told The New Indian Express that the Bidar district administration kept the 11 under home quarantine immediately after knowing that they had visited Delhi for the event. They would now be shifted to the isolation ward.\n\nMeanwhile, the Health and Family Welfare Department has screened 1000 people linked to the event.\n\n\"Based on inputs given by the police and central government, nearly 1000 people linked to the Tablighi Jamaat event in Delhi have been screened by the health department till Thursday morning. Out of them, six have been found symptomatic. Further, more than 200 swab samples have been drawn for them. Tests are going on in labs and out of nearly 100 preliminary test results, 11 from Bidar district are positive. Contact tracing and isolation work are already on,\" said Pankaj Kumar Pandey, commissioner of the health department.\n\n19 of those identified in the state as having attended the event are foreigners.\n\nDistrict Minister V Sommanna said a total of 75 people from Mysuru urban and rural participated in the event, 45 of whom have been quarantined. 17 have failed to return to their hometowns and the remaining are still being traced.\n", "remark": { "pile_set_name": "OpenWebText2" }, "sub_path": "openwebtext2/train" } ``` ## 引文 ``` @misc{conghui2022opendatalab, title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets}, author={Conghui He, Wei Li, Zhenjiang Jin, Bin Wang, Chao Xu, Dahua Lin}, journal={https://opendatalab.com/}, year={2022} } ``` ## Download dataset :modelscope-code[]{type="git"}

显示名称: Pile-OpenWebText2 标签类型: - 英语语料库(English Corpus) 许可证: - MIT 媒体类型: - 文本(Text) 论文链接: 无 发布日期: 2023年7月18日 发布链接: https://pile.eleuther.ai/ 发布方: - EleutherAI 标签: [] 任务类型: - 自然语言生成(Natural Language Generation) - 语言建模(Language Modelling) --- # 数据集介绍 ## 简介 Pile-OpenWebText2 是EleutherAI推出的The Pile数据集的子集,其为原始OpenWebTextCorpus的增强版本,是一款兼具多样性与开源属性的语言建模专用数据集。 ## 数据内容 ### 数据说明 Pile-OpenWebText2 的数据总规模达56.8吉字节。 ### 数据示例 { "id": "158625874", "source_id": "", "doc_id": "26047531", "data_type": "text", "data_source": "pile", "data_url": "enwiki-c4-pile-ccnews", "content": "COVID-19: 11名从德里Tablighi Jamaat集会返回的比达尔居民新冠病毒检测呈阳性,累计筛查1000人 《快报新闻服务》 | 发布时间:2020年4月2日 上午10:21 仅供示意 比达尔:11名来自比达尔、曾参与德里举办的Tablighi Jamaat集会的居民于周四新冠病毒检测呈阳性。 比达尔副专员HR·马哈德夫告诉《新印度快报》,比达尔地区行政部门在得知这些人前往德里参加集会后,立即将11人置于居家隔离,他们随后将被转移至隔离病房。 与此同时,卫生与家庭福利部门已对与该集会相关的1000人进行了筛查。 基于警方和中央政府提供的信息,截至周四上午,卫生部门已对近1000名与德里Tablighi Jamaat集会相关的人员完成筛查。其中6人出现症状。此外,已为他们采集了200多份拭子样本。实验室正在进行检测,在近100份初步检测结果中,比达尔地区有11人呈阳性。接触者追踪与隔离工作已全面展开,卫生部门专员潘卡吉·库马尔·潘迪表示。 该邦已确认的参与此次集会的人员中,有19人为外国人。 地区部长V·索曼纳表示,迈索尔城乡地区共有75人参与了此次集会,其中45人已被隔离。17人尚未返回原籍地,其余人员仍在追踪中。 ", "remark": { "pile_set_name": "OpenWebText2" }, "sub_path": "openwebtext2/train" } ## 引文 @misc{conghui2022opendatalab, title={OpenDataLab:以开放数据集赋能通用人工智能(General Artificial Intelligence)}, author={何聪辉、李威、金振江、王斌、徐超、林达华}, journal={https://opendatalab.com/}, year={2022} } ## 下载数据集 :modelscope-code[]{type="git"}
提供机构:
maas
创建时间:
2024-07-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作