Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20-drop-majzuu-wafer
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20-drop-majzuu-wafer
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
license: apache-2.0
pretty_name: Ashaar v1 SFT Ready Locked Prompt Maxlen20 Drop Majzuu Wafer (2026-04-03)
task_categories:
- text-generation
size_categories:
- 100K<n<1M
---
# Ashaar v1 SFT-Ready (Locked Prompt, <= 2048 tokens, max 20 bayts, drop مجزوء الوافر)
This dataset is derived from `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20` and keeps the same schema, columns, locked prompt format, and general structure as the upstream phase-1 dataset.
The only additional change is the removal of rows where:
- `base_meter == "الوافر"`
- `form == "مجزوء"`
This removes poems labeled `مجزوء الوافر` from the published phase-1 subset.
## Why this variant exists
The phase-1 meter reward uses a BiLSTM meter classifier that primarily knows base-meter classes rather than all form-specific labels. A focused audit of non-`تام` poems showed that most form variants are still recognized well as their base meter, but `مجزوء الوافر` was a clear outlier.
In the audit:
- non-`تام` poems were scored against their `base_meter`
- overall behavior supported base-meter fallback for most forms
- `مجزوء الوافر` had weak agreement with the base-meter target and was often confused with `الهزج`
This derivative dataset therefore removes `مجزوء الوافر` from the phase-1 training subset to make the meter reward more aligned with the label space of the classifier.
## Locked Prompt
### SYSTEM_PROMPT
أنت شاعر عربي تكتب الشعر العمودي الكلاسيكي.
التزم بالبحر المحدد في كل شطر، واستلهم من الموضوع دون نقله حرفياً.
أخرج الأبيات فقط دون مقدمة أو تعليق.
### USER_TEMPLATE
البحر الأساسي: {base_meter}
الصيغة: {form}
اسم البحر المطلوب: {meter_label}
الموضوع: {description}
اكتب {num_lines} شطراً ملتزماً بصيغة {form} من بحر {base_meter} دون أي شرح إضافي.
## Conditioning Rule
- `meter_label = base_meter` if `form == "تام"`
- else `meter_label = "{form} {base_meter}"`
## Added Columns
- `sft_prompt`
- `sft_completion`
- `sft_full_text`
- `sft_num_lines`
- `sft_total_tokens`
## Target Formatting
- `sft_completion` is built from `poem verses` using real newline characters.
## Filtering
- Inherits all filtering already present in `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20`.
- Additional filter applied here: drop rows where `base_meter == "الوافر"` and `form == "مجزوء"`.
## Counts
- Source rows from `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20`: **117876**
- Removed by dropping `مجزوء الوافر`: **252**
- Final rows kept: **117624**
- Retention from upstream phase-1 subset: **99.79%**
## Training-Prep Note
- After the repo's current `load_and_prepare_dataset(...)` preprocessing, the upstream phase-1 dataset yields **117655** usable rows.
- This derivative yields **117404** usable rows after the same preparation path.
## Important Meter-Reward Caveat
- The current meter reward does **not** directly penalize or reward the exact **form realization** of a meter when the classifier lacks that form-specific label.
- In practice, unsupported labels are evaluated through their **base meter**.
- This means the phase-1 meter reward is primarily a **base-meter correctness** signal, not a full form-sensitive metrical validator.
## Notes
- This is a derived dataset repo; upstream datasets are unchanged.
- The schema and column names are kept identical to the upstream dataset.
- This subset is intended for phase-1 GRPO experiments where a cleaner alignment between dataset labels and the meter classifier is preferred.
提供机构:
Shaer-AI



