Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
license: apache-2.0
pretty_name: Ashaar Enhanced Description SFT Stratified Splits
task_categories:
- text-generation
size_categories:
- 100K<n<1M
---
# Ashaar Enhanced Description SFT Stratified Splits
Source dataset:
- `Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500`
Target dataset:
- `Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits`
This dataset publishes deterministic `train / eval / test` splits with a `94 / 3 / 3` policy.
## Split policy
Primary stratification key:
- `base_meter`
- `form`
- `length_bucket`
Length buckets:
- `1-3`
- `4-6`
- `7-10`
- `11-20`
Small groups fall back gracefully to coarser stratification levels when needed.
## Counts
- train: **109070**
- eval: **3481**
- test: **3481**
## Stratification fallback levels used
- `base_meter_form_length_bucket`: **116032**
- `base_meter_form`: **0**
- `base_meter`: **0**
- `global`: **0**
## Notes
- `sampler_group` keeps the fine-grained joint group `base_meter||form||length_bucket`
- `split_group` is the actual group used to allocate split quotas after fallback
- weighted sampling should be applied only on the `train` split
提供机构:
Shaer-AI



