15juneee/romanian-corpus
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/15juneee/romanian-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ro
license: other
license_name: mixed-sources
license_link: https://huggingface.co/datasets/15juneee/romanian-corpus
multilinguality:
- monolingual
size_categories:
- 100M<n<1B
source_datasets:
- uonlp/CulturaX
- wikimedia/wikipedia
tags:
- romanian
- pretraining
- text
configs:
- config_name: all
data_files: "*/train/*.parquet"
default: true
- config_name: wikipedia
data_files: "wikipedia/train/*.parquet"
- config_name: mC4
data_files: "mC4/train/*.parquet"
- config_name: OSCAR-2019
data_files: "OSCAR-2019/train/*.parquet"
- config_name: OSCAR-2109
data_files: "OSCAR-2109/train/*.parquet"
- config_name: OSCAR-2201
data_files: "OSCAR-2201/train/*.parquet"
- config_name: OSCAR-2301
data_files: "OSCAR-2301/train/*.parquet"
- config_name: eurlex
data_files: "eurlex/train/*.parquet"
- config_name: legislatie_just_ro
data_files: "legislatie.just.ro/train/*.parquet"
- config_name: cdep_ro
data_files: "cdep.ro/train/*.parquet"
---
# Romanian Text Corpus
A comprehensive, high-quality Romanian text corpus for language model pretraining.
Built by collecting and cleaning text from five Romanian-language sources.
## Dataset Summary
- **Total documents:** 19,886,412
- **Estimated tokens:** ~20.8B
- **Language:** Romanian (ro)
- **Format:** Parquet (zstd compressed)
## Source Breakdown
| Source | Documents |
|--------|-----------|
| mC4 | 16,875,310 |
| OSCAR-2109 | 881,722 |
| OSCAR-2301 | 704,312 |
| OSCAR-2019 | 703,991 |
| OSCAR-2201 | 439,778 |
| wikipedia | 224,193 |
| legislatie.just.ro | 48,800 |
| eurlex | 5,547 |
| cdep.ro | 2,759 |
## Cleaning Pipeline
Each document passed through a 5-stage pipeline before inclusion:
1. **Basic cleaning** — encoding fixes (ftfy), HTML tag removal, whitespace normalization, length filter (100–100,000 words)
2. **Language detection** — lingua detector + Romanian diacritics ratio filter (≥0.3%)
3. **Boilerplate removal** — lines appearing 100+ times across the corpus are removed (cookie banners, nav text, legal disclaimers)
4. **Deduplication** — URL + content hash dedup across all sources (priority: Wikipedia > EUR-Lex > legislation > parliament > CulturaX)
5. **Shuffle** — reservoir shuffle to randomize training order
## Configs / Subsets
Load a specific source or the full corpus:
```python
from datasets import load_dataset
# Full corpus
ds = load_dataset("15juneee/romanian-corpus", "all", split="train")
# Individual sources
wiki = load_dataset("15juneee/romanian-corpus", "wikipedia", split="train")
mc4 = load_dataset("15juneee/romanian-corpus", "mC4", split="train")
o19 = load_dataset("15juneee/romanian-corpus", "OSCAR-2019", split="train")
o21 = load_dataset("15juneee/romanian-corpus", "OSCAR-2109", split="train")
o22 = load_dataset("15juneee/romanian-corpus", "OSCAR-2201", split="train")
o23 = load_dataset("15juneee/romanian-corpus", "OSCAR-2301", split="train")
law = load_dataset("15juneee/romanian-corpus", "eurlex", split="train")
leg = load_dataset("15juneee/romanian-corpus", "legislatie_just_ro", split="train")
par = load_dataset("15juneee/romanian-corpus", "cdep_ro", split="train")
```
## Schema
| Field | Type | Description |
|-------|------|-------------|
| `text` | string | Document text |
| `source` | string | Source identifier |
| `url` | string | Original URL (empty if unavailable) |
| `word_count` | int32 | Word count after cleaning |
## License
Each source carries its own license:
| Source | License |
|--------|---------|
| `wikipedia` | [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) |
| `mC4`, `OSCAR-*` | [mC4](https://huggingface.co/datasets/mc4) + [OSCAR](https://huggingface.co/datasets/oscar) licenses (non-commercial research) |
| `eurlex` | EU public domain |
| `legislatie_just_ro` | Romanian public domain |
| `cdep_ro` | Romanian public domain |
**Important:** Due to the mC4/OSCAR subsets, the full `all` config inherits a non-commercial research restriction. For commercial use, load only `wikipedia`, `eurlex`, `legislatie_just_ro`, and `cdep_ro`.
提供机构:
15juneee



