CarperAI/pile-v2-small-filtered
收藏Hugging Face2022-12-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/CarperAI/pile-v2-small-filtered
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators: []
language_creators:
- crowdsourced
language: ["en","code"]
multilinguality:
- multilingual
size_categories:
- unknown
source_datasets: []
task_categories:
- text-generation
task_ids:
- language-modeling
---
## Dataset Description
A small subset in each dataset of `pile-v2`(~1000 samples) of [pile-v2]() dataset, each has 1,000 random samples from the original dataset. The dataset has 255MB of text (code and english).
## Languages
The dataset contains technical text on programming languages and natural language with the following subsets,
- Bible
- TED2020
- PileOfLaw
- StackExchange
- GithubIssues
- Opensubtitles
- USPTO
- S2ORC
- DevDocs
- CodePileReddit2022
- USENET
- GNOME
- ASFPublicMail
- PileV2Reddit2020
- CodePilePosts
- Discourse
- Tanzil
- arXiv
- UbuntuIRC
- PubMed
- CodePileReddit2020
- CodePileReddit2021
- GlobalVoices
- FreeLaw_Options
- PileV2Posts
## Dataset Structure
```python
from datasets import load_dataset
load_dataset("CarperAI/pile-v2-small")
```
### How to use it
You can either load the whole dataset like above, or load a specific subset such as arxiv by specifying the folder directory:
```python
load_dataset("CarperAI/pile-v2-small", data_dir="data/arxiv")
```
提供机构:
CarperAI
原始信息汇总
数据集概述
基本信息
- 数据集名称: pile-v2-small
- 数据集大小: 255MB
- 数据类型: 文本(代码和英语)
- 样本数量: 约1000个样本
- 任务类型: 文本生成
- 任务ID: 语言建模
语言和多语言性
- 语言: 英语和代码
- 多语言性: 多语言
数据集结构
- 子集: 包括Bible, TED2020, PileOfLaw, StackExchange, GithubIssues, Opensubtitles, USPTO, S2ORC, DevDocs, CodePileReddit2022, USENET, GNOME, ASFPublicMail, PileV2Reddit2020, CodePilePosts, Discourse, Tanzil, arXiv, UbuntuIRC, PubMed, CodePileReddit2020, CodePileReddit2021, GlobalVoices, FreeLaw_Options, PileV2Posts等。
使用方法
-
加载整个数据集: python from datasets import load_dataset load_dataset("CarperAI/pile-v2-small")
-
加载特定子集: python load_dataset("CarperAI/pile-v2-small", data_dir="data/arxiv")



