Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20-drop-majzuu-wafer-drop-motadarak
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20-drop-majzuu-wafer-drop-motadarak
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
license: apache-2.0
pretty_name: Ashaar v1 SFT Ready Locked Prompt Maxlen20 Drop Majzuu Wafer Drop Motadarak (2026-04-03)
task_categories:
- text-generation
size_categories:
- 100K<n<1M
---
# Ashaar v1 SFT-Ready (Locked Prompt, <= 2048 tokens, max 20 bayts, drop مجزوء الوافر, drop المتدارك)
This dataset is derived from `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20-drop-majzuu-wafer` and keeps the same schema, columns, locked prompt format, and general structure as the upstream phase-1 dataset.
The only additional change is the removal of rows where:
- `base_meter == "المتدارك"`
This removes all poems whose base meter is `المتدارك` from the published phase-1 subset.
## Why this variant exists
The phase-1 meter reward is explicitly base-meter-first. A dedicated base-only sanity study with `n=5` prompts per base meter and `k=5` sampled candidates per prompt showed that `المتدارك` remained a clear outlier even under the more GRPO-relevant best-of-`k` view.
In that study:
- `المتدارك` had very weak candidate-level scores
- `المتدارك` also failed the prompt-level best-of-`k` criterion
- no prompt produced a candidate that crossed the useful quality thresholds used for review
Because the goal of this phase-1 subset is to align the training data with the available meter reward and keep RL signal reasonably clean, this derivative removes `المتدارك` from the current phase-1 dataset.
## Locked Prompt
### SYSTEM_PROMPT
أنت شاعر عربي تكتب الشعر العمودي الكلاسيكي.
التزم بالبحر المحدد في كل شطر، واستلهم من الموضوع دون نقله حرفياً.
أخرج الأبيات فقط دون مقدمة أو تعليق.
### USER_TEMPLATE
البحر الأساسي: {base_meter}
الصيغة: {form}
اسم البحر المطلوب: {meter_label}
الموضوع: {description}
اكتب {num_lines} شطراً ملتزماً بصيغة {form} من بحر {base_meter} دون أي شرح إضافي.
## Conditioning Rule
- `meter_label = base_meter` if `form == "تام"`
- else `meter_label = "{form} {base_meter}"`
## Added Columns
- `sft_prompt`
- `sft_completion`
- `sft_full_text`
- `sft_num_lines`
- `sft_total_tokens`
## Target Formatting
- `sft_completion` is built from `poem verses` using real newline characters.
## Filtering
- Inherits all filtering already present in `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20-drop-majzuu-wafer`.
- Additional filter applied here: drop rows where `base_meter == "المتدارك"`.
## Counts
- Source rows from `Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20-drop-majzuu-wafer`: **117624**
- Removed by dropping `المتدارك`: **138**
- Final rows kept: **117486**
- Retention from upstream phase-1 subset: **99.88%**
## Training-Prep Note
- After the repo's current `load_and_prepare_dataset(...)` preprocessing, the upstream phase-1 dataset yields **117404** usable rows.
- This derivative yields **117266** usable rows after the same preparation path.
## Important Meter-Reward Caveat
- The current meter reward is primarily a **base-meter correctness** signal.
- Form-sensitive meter realization is not directly validated when the classifier lacks that exact form label.
- This dataset change should therefore be understood as alignment with a **base-meter-first** reward, not as a claim that the dropped meter is impossible in general.
## Notes
- This is a derived dataset repo; upstream datasets are unchanged.
- The schema and column names are kept identical to the upstream dataset.
- This subset is intended for phase-1 GRPO experiments where the active meter-reward signal is more reliable on the retained base-meter set.
---
language:
- 阿拉伯语(ar)
license: Apache-2.0
pretty_name: Ashaar v1 监督微调(SFT,Supervised Fine-Tuning)就绪版(锁定提示词、最大Token长度2048、单样本最多20诗行、移除المجزوء الوافر与المتدارك(2026-04-03))
task_categories:
- 文本生成
size_categories:
- 10万<n<100万
---
# Ashaar v1 监督微调(SFT,Supervised Fine-Tuning)就绪版(锁定提示词、Token最大长度2048、单样本最多20诗行、移除المجزوء الوافر与المتدارك)
本数据集衍生自`Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20-drop-majzuu-wafer`,与上游第一阶段数据集保持一致的模式、字段、锁定提示词格式与整体结构。
本次仅新增一处修改:移除`base_meter`字段值为`المتدارك`的数据行。
由此将已发布的第一阶段子集中所有以`المتدارك`为基础格律的诗歌全部移除。
## 本变体数据集的设计缘由
第一阶段的格律奖励机制明确以基础格律优先。一项专门针对基础格律的合理性验证研究中,我们为每种基础格律设置`n=5`个提示词,每个提示词生成`k=5`个候选样本,结果显示即使在更贴合群体奖励策略优化(GRPO,Group Reward Policy Optimization)的最佳`k`候选评估视角下,`المتدارك`仍为显著的异常值。
在该研究中:
- `المتدارك`对应的候选样本得分极低
- `المتدارك`同样未通过提示词级别的最佳`k`筛选标准
- 无任何提示词能生成符合评审用有效质量阈值的候选样本
由于本第一阶段子集的目标是使训练数据与现有格律奖励机制对齐,并确保强化学习(RL,Reinforcement Learning)信号尽可能纯净,因此本衍生数据集将`المتدارك`从当前第一阶段数据集中移除。
## 锁定提示词格式
### 系统提示词
你是一位阿拉伯语古典格律体诗歌诗人。
请严格遵循每一组诗行对应的格律,并从主题中汲取灵感,切勿直接照搬原文。
仅输出诗行,无需添加任何前言或注释。
### 用户提示词模板
基础格律:{base_meter}
诗体形式:{form}
所需格律名称:{meter_label}
主题:{description}
请根据{base_meter}格律与{form}诗体形式,创作{num_lines}组诗行,无需任何额外说明。
## 条件设定规则
- 当`form == "تام"`时,`meter_label = base_meter`
- 其余情况,`meter_label = "{form} {base_meter}"`
## 新增字段
- `sft_prompt`
- `sft_completion`
- `sft_full_text`
- `sft_num_lines`
- `sft_total_tokens`
## 目标格式规范
- `sft_completion`通过真实换行符从诗歌诗行构建生成。
## 筛选规则
- 继承`Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20-drop-majzuu-wafer`中已有的全部筛选规则
- 本次新增筛选规则:移除`base_meter == "المتدارك"`的数据行。
## 数据量统计
- 源数据集`Shaer-AI/ashaar-with-descriptions-baseform-final-trimmed-maxlen20-drop-majzuu-wafer`的数据行数:**117624**
- 因移除`المتدارك`而删除的数据行数:**138**
- 最终保留的数据行数:**117486**
- 上游第一阶段子集数据留存率:**99.88%**
## 训练准备说明
- 在执行本仓库当前的`load_and_prepare_dataset(...)`预处理流程后,上游第一阶段数据集可获得**117404**条可用数据行
- 本衍生数据集经相同预处理流程后,可获得**117266**条可用数据行
## 格律奖励机制重要说明
- 当前的格律奖励机制本质上以**基础格律正确性**为核心信号
- 当分类器缺少对应诗体形式的标签时,无法直接验证基于诗体形式的格律实现效果
- 因此本数据集的修改仅为适配**基础格律优先**的奖励机制,并非宣称被移除的格律在整体上不可行。
## 补充说明
- 本仓库为衍生数据集仓库,上游数据集未做任何修改
- 本数据集的模式与字段名称与上游数据集完全一致
- 本子集专为第一阶段群体奖励策略优化(GRPO)实验设计,在保留的基础格律集合上,活跃的格律奖励信号更为可靠。
提供机构:
Shaer-AI



