Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
收藏Hugging Face2026-04-11 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ar
license: apache-2.0
pretty_name: Shaer GRPO Preprocess Meter Count V1
task_categories:
- text-generation
size_categories:
- 100K<n<1M
---
# Shaer GRPO Preprocess: Meter + Count
This is a derived preprocessing dataset for future GRPO-side analysis and difficulty-aware sampling.
It is built from the `train` split of the released stratified split dataset, and it keeps one derived row per source row together with candidate-level meter/count scoring signals.
This repo is intentionally **not** a chosen-completion dataset:
- there is no selected / chosen completion column
- the main `train` table is a score-bearing dataset for relabeling and analysis
- the full scored backup artifact in this repo keeps the richer per-candidate records, including completion text
## Dataset lineage
Upstream datasets:
- parent dataset: `Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500`
- split dataset consumed here: `Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits`
- derived output dataset: `Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1`
The split dataset publishes deterministic `train / eval / test` splits with a `94 / 3 / 3` policy.
Original split policy inherited from the source split dataset:
- primary stratification key:
- `base_meter`
- `form`
- `length_bucket`
- length buckets:
- `1-3`
- `4-6`
- `7-10`
- `11-20`
- small groups fall back gracefully when needed
Known counts in the source split dataset:
- `train`: **109070**
- `eval`: **3481**
- `test`: **3481**
This derived repo uses **train only**.
## Run configuration
Generation model:
- base model: `Navid-AI/Yehia-7B-preview`
- adapter repo: `Shaer-AI/Shaer-adapters`
- adapter mode: `fresh_sft/train`
Preprocess configuration:
- source split: `train`
- rows assembled: **109070**
- missing scored rows at assembly time: **0**
- sampled candidates per row: **8**
- total candidate records: **872560**
- preprocess design: `meter_count_train_only_v1`
Difficulty counts in this release:
- `easy`: **33606**
- `hard`: **29649**
- `medium`: **45815**
Base-meter counts in this release:
- `البسيط`: **16563**
- `الخفيف`: **9263**
- `الرجز`: **4311**
- `الرمل`: **4316**
- `السريع`: **6758**
- `الطويل`: **26346**
- `الكامل`: **20375**
- `المتقارب`: **4877**
- `المجتث`: **1938**
- `المديد`: **504**
- `المنسرح`: **2579**
- `الهزج`: **573**
- `الوافر`: **10667**
## What the main `train` dataset contains
Every row preserves the source row identity plus derived preprocess outputs.
Important row-level fields:
- source identity: `row_uid`, `source_dataset_id`, `source_split`, `source_index`, `source_row_index_in_split`, `source_id`
- source context retained for downstream analysis: `base_meter`, `form`, `meter_label`, `requested_bayts`, `requested_lines`, `length_bucket`, `sampler_group`, `split_group`, `description`, `enhanced_description`, `sft_prompt`, `poem_url`
- row-level derived metrics: `difficulty`, `mean_meter_score`, `mean_count_adherence_score`, `num_strong_meter_candidates`, `num_strong_exact_candidates`, `num_bad_meter_candidates`, `strong_meter_rate`, `strong_exact_rate`, `bad_meter_rate`
- row-level aggregate summaries: `meter_summary`, `count_adherence_summary`, `meter_count_combined_summary`, `meter_count_success_rate`, `count_exact_rate`
- lightweight runtime provenance: `generator_shard_id`, `manifest_order`, `preprocess_metadata`
Each row also contains a `candidates` list. In the main `train` table each candidate keeps only the compact score-bearing fields needed for relabeling:
- `candidate_id`, `generation_index`
- `meter_score`, `meter_mean_score`, `meter_logmean_score`
- `meter_target_used`, `meter_target_resolution`
- `meter_num_valid_bayts`, `meter_num_skipped_bayts`
- `count_adherence_score`, `generated_bayts`, `count_requested_bayts`, `count_has_odd_tail`
- `finish_reason`, `token_count`, `cumulative_logprob`
To keep the published table smaller, the main dataset intentionally omits:
- candidate `completion` text
- `per_bayt_meter_scores`
- `per_bayt_meter_details`
- full `generator_metadata`
- full `scoring_metadata`
## Backup artifacts stored in this repo
This repo also stores side artifacts alongside the main dataset:
- `artifacts/scored_rows.jsonl.gz`: compressed raw scored rows backup
- `artifacts/assembly_summary.json`: assembly counts and publish summary
- `artifacts/run_config.json`: run configuration snapshot
- `artifacts/model_meta.json`: resolved model + adapter metadata
- `artifacts/difficulty_rule.json`: difficulty rule used for this release
The compressed scored backup keeps the richer row records that were not copied into the main table, including:
- candidate `completion` text
- `per_bayt_meter_scores`
- `per_bayt_meter_details`
- full `generator_metadata`
- full `scoring_metadata`
## Where the full 8 completions live
Yes: all `8` sampled completions per source row were pushed, but they live in the backup artifact rather than the main `train` table.
Storage layout:
- the main `train` dataset keeps a compact `candidates` list with score-bearing fields only
- the full completion texts are stored in `artifacts/scored_rows.jsonl.gz`
- `artifacts/scored_rows.jsonl.gz` is a gzip-compressed JSONL file
- each JSON line corresponds to one source row
- each row contains `candidate_scores`, which for this release has length `8`
- each entry inside `candidate_scores` includes the candidate `completion` plus the richer scoring details
So if you want to inspect or rescore the exact generated poems later, `artifacts/scored_rows.jsonl.gz` is the file to load.
Current compressed scored backup size:
- **819.9 MB**
## Scoring design
This preprocessing redesign uses **meter + count only**.
Scorers used:
- **meter**: the BiLSTM meter scorer from this repo, using the requested meter/base-meter context
- **count adherence**: exact bayt-count adherence helper from this repo
Candidate-level scoring notes:
- `meter_score` is the main meter score
- `meter_mean_score` and `meter_logmean_score` are alternate aggregate views exposed per candidate
- `count_adherence_score = 1.0` when generated bayt count exactly matches the requested bayt count
- otherwise `count_adherence_score = max(0, 1 - abs(generated_bayts - requested_bayts) / max(1, requested_bayts))`
- row-level `meter_count_combined_summary` is computed over `0.8 * meter_score + 0.2 * count_adherence_score`
This release does **not** use:
- `meaning_fit`
- `meaning_substance`
- any chosen / selected completion target
## Difficulty labeling rule used in this release
For `K=8` candidates per row:
- `strong_meter` means candidate `meter_score >= 0.70`
- `strong_exact` means candidate `meter_score >= 0.70` **and** exact bayt count
- `easy`: `num_strong_exact >= 4` and `mean_meter_score > 0.50`
- `hard`: `num_strong_meter == 0` or (`num_strong_meter <= 1` and `mean_meter_score < 0.30`)
- `medium`: everything else
The `difficulty` label is a convenience label, not a sacred frozen annotation.
## How to relabel rows later with a new rule
You can relabel rows without rerunning generation.
### Option 1: relabel from the main published `train` dataset
Use the compact candidate score records already published in the main dataset:
```python
from datasets import load_dataset
repo_id = "Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1"
ds = load_dataset(repo_id, split="train")
def relabel(row):
candidates = row["candidates"]
num_strong_meter = sum(c["meter_score"] >= 0.70 for c in candidates)
num_strong_exact = sum(
c["meter_score"] >= 0.70 and c["count_adherence_score"] >= 0.999999
for c in candidates
)
mean_meter = sum(c["meter_score"] for c in candidates) / max(1, len(candidates))
if num_strong_exact >= 4 and mean_meter > 0.50:
return "easy"
if num_strong_meter == 0 or (num_strong_meter <= 1 and mean_meter < 0.30):
return "hard"
return "medium"
ds = ds.map(lambda row: {"difficulty_recomputed": relabel(row)})
```
Replace the thresholds above with your new rule.
### Option 2: relabel from the full scored backup artifact
If your new rule needs completion text or per-bayt details, read `artifacts/scored_rows.jsonl.gz`:
```python
import gzip
import json
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1",
repo_type="dataset",
filename="artifacts/scored_rows.jsonl.gz",
)
with gzip.open(path, "rt", encoding="utf-8") as f:
for line in f:
row = json.loads(line)
# row['candidate_scores'] contains completion text and richer meter details
```
### Option 3: rescore the pushed completions with a new reward
Because the backup artifact stores the full candidate completions, you can compute a brand-new reward later without rerunning generation.
Example sketch:
```python
import gzip
import json
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1",
repo_type="dataset",
filename="artifacts/scored_rows.jsonl.gz",
)
def new_reward(row, candidate):
completion = candidate["completion"]
description = row["description"]
enhanced_description = row["enhanced_description"]
# Replace this with your new reward logic.
return 0.0
rescored_rows = []
with gzip.open(path, "rt", encoding="utf-8") as f:
for line in f:
row = json.loads(line)
new_scores = []
for candidate in row["candidate_scores"]:
score = new_reward(row, candidate)
candidate = dict(candidate)
candidate["new_reward_score"] = float(score)
new_scores.append(candidate)
row["candidate_scores"] = new_scores
rescored_rows.append(row)
```
Typical follow-up flow:
- download `artifacts/scored_rows.jsonl.gz`
- compute a new per-candidate reward from `completion` and any row context you need
- store the new score beside the existing meter/count scores
- recompute row-level summaries and difficulty labels from the rescored candidate set
This means generation is reusable: if you want to test a new reward later, you can usually start from the pushed completions rather than rerunning the full preprocess.
## Local assembly summary
Machine-readable release summary:
```json
{
"assembled_at": "2026-04-11T07:22:16Z",
"repo_id": "Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1",
"run_dir": "/root/workspace/Shaer/grpo/outputs/preprocess_meter_count/full_train_k8_20260409_192820",
"manifest_rows": 109070,
"assembled_rows": 109070,
"missing_scored_rows": 0,
"train_jsonl": "/root/workspace/Shaer/grpo/outputs/preprocess_meter_count/full_train_k8_20260409_192820/assembled/train.jsonl",
"difficulty_counts": {
"easy": 33606,
"hard": 29649,
"medium": 45815
},
"base_meter_counts": {
"البسيط": 16563,
"الخفيف": 9263,
"الرجز": 4311,
"الرمل": 4316,
"السريع": 6758,
"الطويل": 26346,
"الكامل": 20375,
"المتقارب": 4877,
"المجتث": 1938,
"المديد": 504,
"المنسرح": 2579,
"الهزج": 573,
"الوافر": 10667
},
"candidate_count_min": 8,
"candidate_count_max": 8,
"candidate_count_sum": 872560,
"dry_run": false,
"push": true,
"include_completions": false,
"scored_backup_requested": true,
"dataset_rows": 109070,
"uploaded_files": [
"README.md",
"artifacts/assembly_summary.json",
"artifacts/run_config.json",
"artifacts/model_meta.json",
"artifacts/difficulty_rule.json",
"artifacts/scored_rows.jsonl.gz"
],
"scored_backup_bytes": 859750212
}
```
提供机构:
Shaer-AI



