five

Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20-drop-majzuu-wafer

收藏
Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20-drop-majzuu-wafer
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar license: apache-2.0 pretty_name: Ashaar v1 SFT Ready Locked Prompt Maxlen20 Drop Majzuu Wafer (2026-04-03) task_categories: - text-generation size_categories: - 100K<n<1M --- # Ashaar v1 SFT-Ready (Locked Prompt, <= 2048 tokens, max 20 bayts, drop مجزوء الوافر) This dataset is derived from `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20` and keeps the same schema, columns, locked prompt format, and general structure as the upstream phase-1 dataset. The only additional change is the removal of rows where: - `base_meter == "الوافر"` - `form == "مجزوء"` This removes poems labeled `مجزوء الوافر` from the published phase-1 subset. ## Why this variant exists The phase-1 meter reward uses a BiLSTM meter classifier that primarily knows base-meter classes rather than all form-specific labels. A focused audit of non-`تام` poems showed that most form variants are still recognized well as their base meter, but `مجزوء الوافر` was a clear outlier. In the audit: - non-`تام` poems were scored against their `base_meter` - overall behavior supported base-meter fallback for most forms - `مجزوء الوافر` had weak agreement with the base-meter target and was often confused with `الهزج` This derivative dataset therefore removes `مجزوء الوافر` from the phase-1 training subset to make the meter reward more aligned with the label space of the classifier. ## Locked Prompt ### SYSTEM_PROMPT أنت شاعر عربي تكتب الشعر العمودي الكلاسيكي. التزم بالبحر المحدد في كل شطر، واستلهم من الموضوع دون نقله حرفياً. أخرج الأبيات فقط دون مقدمة أو تعليق. ### USER_TEMPLATE البحر الأساسي: {base_meter} الصيغة: {form} اسم البحر المطلوب: {meter_label} الموضوع: {description} اكتب {num_lines} شطراً ملتزماً بصيغة {form} من بحر {base_meter} دون أي شرح إضافي. ## Conditioning Rule - `meter_label = base_meter` if `form == "تام"` - else `meter_label = "{form} {base_meter}"` ## Added Columns - `sft_prompt` - `sft_completion` - `sft_full_text` - `sft_num_lines` - `sft_total_tokens` ## Target Formatting - `sft_completion` is built from `poem verses` using real newline characters. ## Filtering - Inherits all filtering already present in `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20`. - Additional filter applied here: drop rows where `base_meter == "الوافر"` and `form == "مجزوء"`. ## Counts - Source rows from `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20`: **117876** - Removed by dropping `مجزوء الوافر`: **252** - Final rows kept: **117624** - Retention from upstream phase-1 subset: **99.79%** ## Training-Prep Note - After the repo's current `load_and_prepare_dataset(...)` preprocessing, the upstream phase-1 dataset yields **117655** usable rows. - This derivative yields **117404** usable rows after the same preparation path. ## Important Meter-Reward Caveat - The current meter reward does **not** directly penalize or reward the exact **form realization** of a meter when the classifier lacks that form-specific label. - In practice, unsupported labels are evaluated through their **base meter**. - This means the phase-1 meter reward is primarily a **base-meter correctness** signal, not a full form-sensitive metrical validator. ## Notes - This is a derived dataset repo; upstream datasets are unchanged. - The schema and column names are kept identical to the upstream dataset. - This subset is intended for phase-1 GRPO experiments where a cleaner alignment between dataset labels and the meter classifier is preferred.
提供机构:
Shaer-AI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作