five

Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1

收藏
Hugging Face2026-04-11 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ar license: apache-2.0 pretty_name: Shaer GRPO Preprocess Meter Count V1 task_categories: - text-generation size_categories: - 100K<n<1M --- # Shaer GRPO Preprocess: Meter + Count This is a derived preprocessing dataset for future GRPO-side analysis and difficulty-aware sampling. It is built from the `train` split of the released stratified split dataset, and it keeps one derived row per source row together with candidate-level meter/count scoring signals. This repo is intentionally **not** a chosen-completion dataset: - there is no selected / chosen completion column - the main `train` table is a score-bearing dataset for relabeling and analysis - the full scored backup artifact in this repo keeps the richer per-candidate records, including completion text ## Dataset lineage Upstream datasets: - parent dataset: `Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500` - split dataset consumed here: `Shaer-AI/ashaar-with-enhanced-descriptions-baseform-final-sft-lte20-min500-splits` - derived output dataset: `Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1` The split dataset publishes deterministic `train / eval / test` splits with a `94 / 3 / 3` policy. Original split policy inherited from the source split dataset: - primary stratification key: - `base_meter` - `form` - `length_bucket` - length buckets: - `1-3` - `4-6` - `7-10` - `11-20` - small groups fall back gracefully when needed Known counts in the source split dataset: - `train`: **109070** - `eval`: **3481** - `test`: **3481** This derived repo uses **train only**. ## Run configuration Generation model: - base model: `Navid-AI/Yehia-7B-preview` - adapter repo: `Shaer-AI/Shaer-adapters` - adapter mode: `fresh_sft/train` Preprocess configuration: - source split: `train` - rows assembled: **109070** - missing scored rows at assembly time: **0** - sampled candidates per row: **8** - total candidate records: **872560** - preprocess design: `meter_count_train_only_v1` Difficulty counts in this release: - `easy`: **33606** - `hard`: **29649** - `medium`: **45815** Base-meter counts in this release: - `البسيط`: **16563** - `الخفيف`: **9263** - `الرجز`: **4311** - `الرمل`: **4316** - `السريع`: **6758** - `الطويل`: **26346** - `الكامل`: **20375** - `المتقارب`: **4877** - `المجتث`: **1938** - `المديد`: **504** - `المنسرح`: **2579** - `الهزج`: **573** - `الوافر`: **10667** ## What the main `train` dataset contains Every row preserves the source row identity plus derived preprocess outputs. Important row-level fields: - source identity: `row_uid`, `source_dataset_id`, `source_split`, `source_index`, `source_row_index_in_split`, `source_id` - source context retained for downstream analysis: `base_meter`, `form`, `meter_label`, `requested_bayts`, `requested_lines`, `length_bucket`, `sampler_group`, `split_group`, `description`, `enhanced_description`, `sft_prompt`, `poem_url` - row-level derived metrics: `difficulty`, `mean_meter_score`, `mean_count_adherence_score`, `num_strong_meter_candidates`, `num_strong_exact_candidates`, `num_bad_meter_candidates`, `strong_meter_rate`, `strong_exact_rate`, `bad_meter_rate` - row-level aggregate summaries: `meter_summary`, `count_adherence_summary`, `meter_count_combined_summary`, `meter_count_success_rate`, `count_exact_rate` - lightweight runtime provenance: `generator_shard_id`, `manifest_order`, `preprocess_metadata` Each row also contains a `candidates` list. In the main `train` table each candidate keeps only the compact score-bearing fields needed for relabeling: - `candidate_id`, `generation_index` - `meter_score`, `meter_mean_score`, `meter_logmean_score` - `meter_target_used`, `meter_target_resolution` - `meter_num_valid_bayts`, `meter_num_skipped_bayts` - `count_adherence_score`, `generated_bayts`, `count_requested_bayts`, `count_has_odd_tail` - `finish_reason`, `token_count`, `cumulative_logprob` To keep the published table smaller, the main dataset intentionally omits: - candidate `completion` text - `per_bayt_meter_scores` - `per_bayt_meter_details` - full `generator_metadata` - full `scoring_metadata` ## Backup artifacts stored in this repo This repo also stores side artifacts alongside the main dataset: - `artifacts/scored_rows.jsonl.gz`: compressed raw scored rows backup - `artifacts/assembly_summary.json`: assembly counts and publish summary - `artifacts/run_config.json`: run configuration snapshot - `artifacts/model_meta.json`: resolved model + adapter metadata - `artifacts/difficulty_rule.json`: difficulty rule used for this release The compressed scored backup keeps the richer row records that were not copied into the main table, including: - candidate `completion` text - `per_bayt_meter_scores` - `per_bayt_meter_details` - full `generator_metadata` - full `scoring_metadata` ## Where the full 8 completions live Yes: all `8` sampled completions per source row were pushed, but they live in the backup artifact rather than the main `train` table. Storage layout: - the main `train` dataset keeps a compact `candidates` list with score-bearing fields only - the full completion texts are stored in `artifacts/scored_rows.jsonl.gz` - `artifacts/scored_rows.jsonl.gz` is a gzip-compressed JSONL file - each JSON line corresponds to one source row - each row contains `candidate_scores`, which for this release has length `8` - each entry inside `candidate_scores` includes the candidate `completion` plus the richer scoring details So if you want to inspect or rescore the exact generated poems later, `artifacts/scored_rows.jsonl.gz` is the file to load. Current compressed scored backup size: - **819.9 MB** ## Scoring design This preprocessing redesign uses **meter + count only**. Scorers used: - **meter**: the BiLSTM meter scorer from this repo, using the requested meter/base-meter context - **count adherence**: exact bayt-count adherence helper from this repo Candidate-level scoring notes: - `meter_score` is the main meter score - `meter_mean_score` and `meter_logmean_score` are alternate aggregate views exposed per candidate - `count_adherence_score = 1.0` when generated bayt count exactly matches the requested bayt count - otherwise `count_adherence_score = max(0, 1 - abs(generated_bayts - requested_bayts) / max(1, requested_bayts))` - row-level `meter_count_combined_summary` is computed over `0.8 * meter_score + 0.2 * count_adherence_score` This release does **not** use: - `meaning_fit` - `meaning_substance` - any chosen / selected completion target ## Difficulty labeling rule used in this release For `K=8` candidates per row: - `strong_meter` means candidate `meter_score >= 0.70` - `strong_exact` means candidate `meter_score >= 0.70` **and** exact bayt count - `easy`: `num_strong_exact >= 4` and `mean_meter_score > 0.50` - `hard`: `num_strong_meter == 0` or (`num_strong_meter <= 1` and `mean_meter_score < 0.30`) - `medium`: everything else The `difficulty` label is a convenience label, not a sacred frozen annotation. ## How to relabel rows later with a new rule You can relabel rows without rerunning generation. ### Option 1: relabel from the main published `train` dataset Use the compact candidate score records already published in the main dataset: ```python from datasets import load_dataset repo_id = "Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1" ds = load_dataset(repo_id, split="train") def relabel(row): candidates = row["candidates"] num_strong_meter = sum(c["meter_score"] >= 0.70 for c in candidates) num_strong_exact = sum( c["meter_score"] >= 0.70 and c["count_adherence_score"] >= 0.999999 for c in candidates ) mean_meter = sum(c["meter_score"] for c in candidates) / max(1, len(candidates)) if num_strong_exact >= 4 and mean_meter > 0.50: return "easy" if num_strong_meter == 0 or (num_strong_meter <= 1 and mean_meter < 0.30): return "hard" return "medium" ds = ds.map(lambda row: {"difficulty_recomputed": relabel(row)}) ``` Replace the thresholds above with your new rule. ### Option 2: relabel from the full scored backup artifact If your new rule needs completion text or per-bayt details, read `artifacts/scored_rows.jsonl.gz`: ```python import gzip import json from huggingface_hub import hf_hub_download path = hf_hub_download( repo_id="Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1", repo_type="dataset", filename="artifacts/scored_rows.jsonl.gz", ) with gzip.open(path, "rt", encoding="utf-8") as f: for line in f: row = json.loads(line) # row['candidate_scores'] contains completion text and richer meter details ``` ### Option 3: rescore the pushed completions with a new reward Because the backup artifact stores the full candidate completions, you can compute a brand-new reward later without rerunning generation. Example sketch: ```python import gzip import json from huggingface_hub import hf_hub_download path = hf_hub_download( repo_id="Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1", repo_type="dataset", filename="artifacts/scored_rows.jsonl.gz", ) def new_reward(row, candidate): completion = candidate["completion"] description = row["description"] enhanced_description = row["enhanced_description"] # Replace this with your new reward logic. return 0.0 rescored_rows = [] with gzip.open(path, "rt", encoding="utf-8") as f: for line in f: row = json.loads(line) new_scores = [] for candidate in row["candidate_scores"]: score = new_reward(row, candidate) candidate = dict(candidate) candidate["new_reward_score"] = float(score) new_scores.append(candidate) row["candidate_scores"] = new_scores rescored_rows.append(row) ``` Typical follow-up flow: - download `artifacts/scored_rows.jsonl.gz` - compute a new per-candidate reward from `completion` and any row context you need - store the new score beside the existing meter/count scores - recompute row-level summaries and difficulty labels from the rescored candidate set This means generation is reusable: if you want to test a new reward later, you can usually start from the pushed completions rather than rerunning the full preprocess. ## Local assembly summary Machine-readable release summary: ```json { "assembled_at": "2026-04-11T07:22:16Z", "repo_id": "Shaer-AI/ashaar-enhanced-desc-baseform-final-sft-lte20-min500-splits-grpo-meter-count-v1", "run_dir": "/root/workspace/Shaer/grpo/outputs/preprocess_meter_count/full_train_k8_20260409_192820", "manifest_rows": 109070, "assembled_rows": 109070, "missing_scored_rows": 0, "train_jsonl": "/root/workspace/Shaer/grpo/outputs/preprocess_meter_count/full_train_k8_20260409_192820/assembled/train.jsonl", "difficulty_counts": { "easy": 33606, "hard": 29649, "medium": 45815 }, "base_meter_counts": { "البسيط": 16563, "الخفيف": 9263, "الرجز": 4311, "الرمل": 4316, "السريع": 6758, "الطويل": 26346, "الكامل": 20375, "المتقارب": 4877, "المجتث": 1938, "المديد": 504, "المنسرح": 2579, "الهزج": 573, "الوافر": 10667 }, "candidate_count_min": 8, "candidate_count_max": 8, "candidate_count_sum": 872560, "dry_run": false, "push": true, "include_completions": false, "scored_backup_requested": true, "dataset_rows": 109070, "uploaded_files": [ "README.md", "artifacts/assembly_summary.json", "artifacts/run_config.json", "artifacts/model_meta.json", "artifacts/difficulty_rule.json", "artifacts/scored_rows.jsonl.gz" ], "scored_backup_bytes": 859750212 } ```
提供机构:
Shaer-AI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作