El1iasss/synth-fr-500k-v1
收藏Hugging Face2026-03-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/El1iasss/synth-fr-500k-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: synth-fr-500k-v1
license: other
language:
- fr
task_categories:
- text-generation
---
# synth-fr-500k-v1
Subset CPT prepare pour experimentations CPT.
## Provenance
- Source amont: `PleIAs/SYNTH`
- Split source: `train`
- Tache: `cpt`
- Genere le: `2026-03-12T10:53:32.448525+00:00`
- Strategie: `all_non_memorization_plus_exact_hash_sample_of_memorization`
## Selection
- Cible totale: `500000` exemples
- Total retenu: `500000` exemples
- Non-memorization: `51134` exemples
- Memorization: `448866` exemples
- Train: `490000` exemples
- Eval: `10000` exemples
- Lignes FR serialisees cote source: `2457662`
## Fichiers
- `processed/train_cpt_fr_synth_fr_500k_strat_v1.jsonl`
- `processed/eval_cpt_fr_synth_fr_500k_strat_v1.jsonl`
- `manifests/cpt_preparation_synth_fr_500k_strat_v1.json`
## Reproduction
1. Generer les fichiers localement avec `scripts/12_prepare_synth_fr_cpt_dataset.py`.
2. Uploader ce subset avec `scripts/14_upload_prepared_dataset_to_hub.py`.
提供机构:
El1iasss



