Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
license: apache-2.0
pretty_name: Ashaar v1 SFT Ready Locked Prompt Maxlen20 (2026-02-21)
task_categories:
- text-generation
size_categories:
- 100K<n<1M
---
# Ashaar v1 SFT-Ready (Locked Prompt, <= 2048 tokens, max 20 bayts)
This dataset is derived from `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed` and prepared as a phase-1 GRPO subset by applying a maximum poem length filter of 20 complete bayts while keeping the same columns, prompt format, and general dataset structure as the source dataset.
## Locked Prompt
### SYSTEM_PROMPT
أنت شاعر عربي تكتب الشعر العمودي الكلاسيكي.
التزم بالبحر المحدد في كل شطر، واستلهم من الموضوع دون نقله حرفياً.
أخرج الأبيات فقط دون مقدمة أو تعليق.
### USER_TEMPLATE
البحر الأساسي: {base_meter}
الصيغة: {form}
اسم البحر المطلوب: {meter_label}
الموضوع: {description}
اكتب {num_lines} شطراً ملتزماً بصيغة {form} من بحر {base_meter} دون أي شرح إضافي.
## Conditioning Rule
- `meter_label = base_meter` if `form == "تام"`
- else `meter_label = "{form} {base_meter}"`
## Added Columns
- `sft_prompt`
- `sft_completion`
- `sft_full_text`
- `sft_num_lines`
- `sft_total_tokens`
## Target Formatting
- `sft_completion` is built from `poem verses` using **real newline characters**.
## Filtering
- Inherits the filtering already present in `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed`.
- Additional phase-1 filter applied: keep rows where the poem has `<= 20` complete bayts, computed from `poem verses` after removing empty lines and pairing lines into bayts.
## Counts
- Source rows from `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed`: **129610**
- Removed by `maxlen20` bayt filter: **11734**
- Final rows kept: **117876**
- Retention: **90.95%**
## Notes
- This is a derived dataset repo; source dataset is unchanged.
- This subset is intended for phase-1 GRPO experiments where shorter poems are preferred for a more controlled RL setup.
- The schema and column names are kept identical to the source dataset.
提供机构:
Shaer-AI



