danbraunai/pile-uncopyrighted
收藏Hugging Face2026-02-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/danbraunai/pile-uncopyrighted
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
dataset_info:
source: monology/pile-uncopyrighted
---
# Pile Uncopyrighted (with train/val/test splits)
This is [monology/pile-uncopyrighted](https://huggingface.co/datasets/monology/pile-uncopyrighted)
re-split into train, val, and test sets.
The original dataset has a single "train" split. This version takes the last
100,000 rows as `test`, the preceding 1,000,000 rows as `val`, and
everything else as `train`.
## Creation script
```python
from datasets import DatasetDict, load_dataset
ds = load_dataset("monology/pile-uncopyrighted", split="train")
n = len(ds)
VAL_SIZE = 1,000,000
TEST_SIZE = 100,000
result = DatasetDict({
"train": ds.select(range(n - VAL_SIZE - TEST_SIZE)),
"val": ds.select(range(n - VAL_SIZE - TEST_SIZE, n - TEST_SIZE)),
"test": ds.select(range(n - TEST_SIZE, n)),
})
result.push_to_hub("danbraunai/pile-uncopyrighted")
```
提供机构:
danbraunai



