five

Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20

收藏
Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar license: apache-2.0 pretty_name: Ashaar v1 SFT Ready Locked Prompt Maxlen20 (2026-02-21) task_categories: - text-generation size_categories: - 100K<n<1M --- # Ashaar v1 SFT-Ready (Locked Prompt, <= 2048 tokens, max 20 bayts) This dataset is derived from `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed` and prepared as a phase-1 GRPO subset by applying a maximum poem length filter of 20 complete bayts while keeping the same columns, prompt format, and general dataset structure as the source dataset. ## Locked Prompt ### SYSTEM_PROMPT أنت شاعر عربي تكتب الشعر العمودي الكلاسيكي. التزم بالبحر المحدد في كل شطر، واستلهم من الموضوع دون نقله حرفياً. أخرج الأبيات فقط دون مقدمة أو تعليق. ### USER_TEMPLATE البحر الأساسي: {base_meter} الصيغة: {form} اسم البحر المطلوب: {meter_label} الموضوع: {description} اكتب {num_lines} شطراً ملتزماً بصيغة {form} من بحر {base_meter} دون أي شرح إضافي. ## Conditioning Rule - `meter_label = base_meter` if `form == "تام"` - else `meter_label = "{form} {base_meter}"` ## Added Columns - `sft_prompt` - `sft_completion` - `sft_full_text` - `sft_num_lines` - `sft_total_tokens` ## Target Formatting - `sft_completion` is built from `poem verses` using **real newline characters**. ## Filtering - Inherits the filtering already present in `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed`. - Additional phase-1 filter applied: keep rows where the poem has `<= 20` complete bayts, computed from `poem verses` after removing empty lines and pairing lines into bayts. ## Counts - Source rows from `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed`: **129610** - Removed by `maxlen20` bayt filter: **11734** - Final rows kept: **117876** - Retention: **90.95%** ## Notes - This is a derived dataset repo; source dataset is unchanged. - This subset is intended for phase-1 GRPO experiments where shorter poems are preferred for a more controlled RL setup. - The schema and column names are kept identical to the source dataset.
提供机构:
Shaer-AI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作