five

CarperAI/pile-v2-small-filtered

收藏
Hugging Face2022-12-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/CarperAI/pile-v2-small-filtered
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: [] language_creators: - crowdsourced language: ["en","code"] multilinguality: - multilingual size_categories: - unknown source_datasets: [] task_categories: - text-generation task_ids: - language-modeling --- ## Dataset Description A small subset in each dataset of `pile-v2`(~1000 samples) of [pile-v2]() dataset, each has 1,000 random samples from the original dataset. The dataset has 255MB of text (code and english). ## Languages The dataset contains technical text on programming languages and natural language with the following subsets, - Bible - TED2020 - PileOfLaw - StackExchange - GithubIssues - Opensubtitles - USPTO - S2ORC - DevDocs - CodePileReddit2022 - USENET - GNOME - ASFPublicMail - PileV2Reddit2020 - CodePilePosts - Discourse - Tanzil - arXiv - UbuntuIRC - PubMed - CodePileReddit2020 - CodePileReddit2021 - GlobalVoices - FreeLaw_Options - PileV2Posts ## Dataset Structure ```python from datasets import load_dataset load_dataset("CarperAI/pile-v2-small") ``` ### How to use it You can either load the whole dataset like above, or load a specific subset such as arxiv by specifying the folder directory: ```python load_dataset("CarperAI/pile-v2-small", data_dir="data/arxiv") ```
提供机构:
CarperAI
原始信息汇总

数据集概述

基本信息

  • 数据集名称: pile-v2-small
  • 数据集大小: 255MB
  • 数据类型: 文本(代码和英语)
  • 样本数量: 约1000个样本
  • 任务类型: 文本生成
  • 任务ID: 语言建模

语言和多语言性

  • 语言: 英语和代码
  • 多语言性: 多语言

数据集结构

  • 子集: 包括Bible, TED2020, PileOfLaw, StackExchange, GithubIssues, Opensubtitles, USPTO, S2ORC, DevDocs, CodePileReddit2022, USENET, GNOME, ASFPublicMail, PileV2Reddit2020, CodePilePosts, Discourse, Tanzil, arXiv, UbuntuIRC, PubMed, CodePileReddit2020, CodePileReddit2021, GlobalVoices, FreeLaw_Options, PileV2Posts等。

使用方法

  • 加载整个数据集: python from datasets import load_dataset load_dataset("CarperAI/pile-v2-small")

  • 加载特定子集: python load_dataset("CarperAI/pile-v2-small", data_dir="data/arxiv")

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作