zarnite/zarn-workspace-rag-qa
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/zarnite/zarn-workspace-rag-qa
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
annotations_creators:
- expert-generated
- machine-generated
language_creators:
- expert-generated
- machine-generated
source_datasets:
- original
task_categories:
- question-answering
- text-generation
tags:
- zarnite
- benchmark
- rag
- grounded-qa
- retrieval
- gold-track
- benchmark-starter
pretty_name: Zarn Workspace RAG QA
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: data/train.jsonl
- split: validation
path: data/validation.jsonl
- split: test
path: data/test.jsonl
---
# Zarn Workspace RAG QA
## Dataset Description
Small document bundles paired with grounded answers, evidence, and explicit refusals when context is missing.
## Team Attribution
This dataset was created and reviewed by the Zarnite team through internal benchmark design, generation, and quality-control workflows. It should be presented as a Zarnite-authored benchmark starter pack, not as a purely human-collected field corpus.
## Ecosystem Need Tier
High Ecosystem Need
## Why This Category Is Attractive
RAG systems fail most often on groundedness and abstention, so richer retrieval benchmarks with traps and known gaps are highly useful.
## Benchmark Goal
Evaluate grounded answering, citation precision, answerability judgment, and refusal quality under partial workspace context.
## Included In This Folder
- `data/train.jsonl`, `data/validation.jsonl`, `data/test.jsonl`: starter benchmark splits with 1200 total rows.
- `schema.json`: JSON Schema for row validation.
- `benchmark_spec.json`: metrics, quality gates, and target release scale.
- `LICENSE.md`: folder-local license notice for self-contained publishing.
- `PUBLISHING.md`: repo-specific publish instructions for Hugging Face.
- `hf_repo_template.json`: machine-readable repo template used by the uploader script.
## Target Public Scale
- Train: 24,000
- Validation: 3,000
- Test: 3,000
- Total target rows: 30,000
## Recommended Metrics
- `answer_and_citation_f1`
- `answerability_accuracy`
- `unsupported_claim_rate`
- `citation_precision`
- `groundedness`
## Gold-Track Benchmark Assets
- `ANNOTATION_GUIDELINES.md`: how to expand rows without drifting from the benchmark purpose.
- `REVIEW_PROTOCOL.md`: how to audit validation and test rows with dual review and adjudication.
- `BASELINE_EVAL_SPEC.json`: expected output contract, slice reporting, and release thresholds.
- `RELEASE_CHECKLIST.md`: final pre-publish checks for the public Hugging Face release.
- `SCORING_PROFILE.json`: prediction keys, scoring expectations, and slice reporting requirements.
- `prediction_template.jsonl`: starter template for benchmark submissions or baseline runs.
## Expanded Row Anatomy
- `knowledge_bundle`: multiple documents with sections, partial authority, and known gaps.
- `query_context`: who is asking and what type of retrieval task this is.
- `unsupported_claim_traps`: tempting details the model must not invent.
- `answerability`: whether the question should be answered or refused.
- `difficulty_rationale`: why the row belongs in its difficulty bucket instead of a weaker slice.
- `benchmark_slices`: named reporting slices such as approval friction, proof preservation, or citation traps.
- `adversarial_features`, `expected_failure_modes`, and `review_readiness`: what the row is testing and how a gold-track reviewer should treat it.
- `evidence_manifest`, `reference_variants`, and `negative_examples`: the source evidence boundary, acceptable alternate answers, and concrete failure cases.
## Hugging Face Deployment
This folder is self-contained and can be uploaded as its own Hugging Face dataset repository.
- Suggested repo id: `zarnite/zarn-workspace-rag-qa`
- Example upload command: `python upload_to_huggingface.py --dataset-folder "push/high-ecosystem-need/Zarn-Workspace-RAG-QA" --repo-id "zarnite/zarn-workspace-rag-qa"`
- You can swap the namespace by passing `--namespace YOUR_USERNAME` to the uploader.
## Local Evaluation
- Example eval command: `python run_priority_eval.py --dataset-folder "push/high-ecosystem-need/Zarn-Workspace-RAG-QA" --splits validation test`
- `prediction_template.jsonl` gives the required output shape for local or leaderboard-style submissions.
## License
This package is marked `apache-2.0`. The rows in this folder are original starter examples for benchmark packaging.
提供机构:
zarnite



