neogenesislab/korean-rag-ssot-golden-50

Name: neogenesislab/korean-rag-ssot-golden-50
Creator: neogenesislab
Published: 2026-04-28 02:58:11
License: 暂无描述

Hugging Face2026-04-28 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/neogenesislab/korean-rag-ssot-golden-50

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ko - en license: cc-by-4.0 task_categories: - question-answering - text-retrieval tags: - rag - korean - evaluation - llm-as-judge - retrieval-augmented-generation - governance - security - agent-evaluation size_categories: - n<1K pretty_name: Korean RAG SSOT Golden 50 (Neo Genesis) configs: - config_name: default data_files: - split: train path: data/tasks.jsonl --- # Korean RAG SSOT Golden 50 (Neo Genesis) > A Korean-language **retrieval-augmented generation evaluation set** of 50 hand-curated tasks built by **[Neo Genesis](https://neogenesis.app)** for stress-testing real-world RAG agents on agent governance, autonomous trading, and security/PII redaction scenarios. ## Why this dataset exists Most Korean RAG benchmarks evaluate factual QA over Wikipedia-style corpora. Production agent systems instead need to retrieve from **operational SSOT** (Single Source of Truth) repositories — design docs, runbooks, code, incident logs, policy YAML — where the answer is grounded in human-authored governance text rather than encyclopedic facts. This benchmark was extracted from Neo Genesis' live SSOT (`/.agent/`) used to operate **11 production AI business units** (UR WRONG, ToolPick, ReviewLab, K-OTT, WhyLab, EthicaAI, FinStack, AIForge, SellKit, DeployStack, CraftDesk). ## Dataset summary - **50 tasks**, all in Korean - **5 task categories** (operational distribution from a real 1-person AI company): | Category | Count | |---|---| | `rag_v2_design` | 18 | | `quant_v11` | 8 | | `ssot_governance` | 12 | | `security_pii` | 6 | | `operations` | 6 | - **5 evaluation metrics** with explicit targets (primary = `recall_at_10`): | Metric | Target | |---|---| | `recall_at_10` (primary) | 0.85 | | `ndcg_at_10` | 0.75 | | `p95_latency_ms` | 500 | | `credential_leak_rate` | 0.0 | | `injection_quarantine_recall` | 0.95 | - **Provenance-aware**: every task carries `expected_source_type` (`human` / `llm_output` / `external_citation` / `tool_log`) so retrievers can be evaluated on whether they correctly trust human-authored chunks over synthetic ones. ## Schema Each task in `data/tasks.jsonl`: ```json { "id": "kor-001", "query": "Phase 0 Day 1 작업 중 Qdrant 컨테이너를 어디에 띄우는가?", "expected_collections": ["neo_ssot"], "expected_chunks": [ ".agent/knowledge/20260426_RAG_MASTER_DESIGN_v1.md", ".agent/knowledge/rag-master/08_rollout_24w.md" ], "expected_answer_substrings": ["ysh-server", "Qdrant", "docker run", "6333"], "expected_source_type": "human", "max_latency_ms": 800 } ``` ## Quick start ```python from datasets import load_dataset ds = load_dataset("neogenesislab/korean-rag-ssot-golden-50", split="train") print(ds[0]["query"]) # => "Phase 0 Day 1 작업 중 Qdrant 컨테이너를 어디에 띄우는가?" ``` Evaluate your retriever: ```python def recall_at_k(retrieved_paths, expected_chunks, k=10): return len(set(retrieved_paths[:k]) & set(expected_chunks)) / max(1, len(expected_chunks)) scores = [] for task in ds: retrieved = my_retriever(task["query"], top_k=10) scores.append(recall_at_k([r["path"] for r in retrieved], task["expected_chunks"])) print(f"Recall@10: {sum(scores)/len(scores):.3f}") ``` ## Evaluation protocol The recommended evaluation stack mirrors Neo Genesis' production setup: 1. **Hybrid retrieval**: BM25 (Korean tokenizer = `kiwipiepy` or `konlpy/mecab-ko`) + dense (KURE-v1 or `BAAI/bge-m3`) fused with **Reciprocal Rank Fusion (k=60)**. 2. **Cross-encoder rerank**: `BAAI/bge-reranker-v2-m3` (free, strong on Korean). 3. **Provenance decay**: down-weight retrieved chunks where `source_type != expected_source_type` by 0.5x (LLM-output) or 0.3x (tool_log noise). 4. **Latency budget**: enforce `max_latency_ms` hard cap per task; tasks that exceed it count as 0 even if recall is high. ## Comparison to prior work | Benchmark | Korean | Operational SSOT | Provenance-aware | Latency budget | |---|---|---|---|---| | KorQuAD 2.0 | ✓ | — | — | — | | KLUE-MRC | ✓ | — | — | — | | MIRACL-ko | ✓ | — | — | — | | **Korean RAG SSOT Golden 50** | ✓ | ✓ | ✓ | ✓ | ## Categories explained | Category | Description | |---|---| | `rag_v2_design` | Architecture decisions, collection topology, embedding model choice, governance YAML | | `quant_v11` | Autonomous-trading agent ensemble (6-alpha portfolio, kill-switch design) | | `ssot_governance` | SSOT hierarchy, runtime adapters, sync protocol across 6 devices | | `security_pii` | Korean PII redaction (RRN / passport / driver's license / KISA / 공인인증서) + PDF prompt-injection sanitization | | `operations` | Day-to-day fleet operations, device tier policy, incident response | ## Provenance - Source SSOT : Neo Genesis private `.agent/` repository - Curator : Yesol Heo (sole founder/operator, [neogenesis.app](https://neogenesis.app)) - Curation : v1 (10 tasks, 2026-04-27) → **v2 (50 tasks, 2026-04-27, this release)** - Wikidata : [Q139569680 (Neo Genesis)](https://www.wikidata.org/wiki/Q139569680) ## Citation ```bibtex @misc{neogenesis_korean_rag_ssot_golden_50, title = {Korean RAG SSOT Golden 50: An operational retrieval evaluation set in Korean}, author = {Heo, Yesol}, year = {2026}, url = {https://huggingface.co/datasets/neogenesislab/korean-rag-ssot-golden-50}, note = {Neo Genesis, an AI-native automation company running 11 live business units} } ``` ## License CC-BY-4.0 — free for research and commercial use with attribution to Neo Genesis. --- ## 한국어 요약 **Korean RAG SSOT Golden 50** 은 한국어로 운영 중인 1인 AI 자동화 기업 **[Neo Genesis](https://neogenesis.app)** 의 라이브 SSOT 에서 추출한 50개의 RAG 검색 평가 태스크다. 위키피디아 사실 QA가 아니라, **실제 운영 문서 (설계서 / 런북 / 정책 YAML / 인시던트 로그)** 에서 정답을 찾는 능력을 검증한다. 한국어 형태소 분석 (`kiwipiepy` / `mecab-ko`) + KURE 임베딩 + BGE Reranker v2-m3 조합 기준으로 Recall@10 ≥ 0.85 를 1차 게이트로 사용한다. 5개 카테고리: - `rag_v2_design` — RAG 아키텍처 의사결정 18건 - `quant_v11` — 자율매매 6-알파 앙상블 8건 - `ssot_governance` — SSOT 계층 / 런타임 동기화 12건 - `security_pii` — 한국어 개인정보 redaction 6건 - `operations` — 플릿 운영 6건 각 태스크는 `expected_source_type` (human / llm_output / external_citation / tool_log) 을 명시해, 검색기가 사람이 작성한 정답 청크를 LLM 출력보다 우선 신뢰하는지 검증한다 (provenance-aware retrieval). 라이선스 CC-BY-4.0 — 인용 시 자유롭게 사용 가능.

提供机构：

neogenesislab

5,000+

优质数据集

54 个

任务类型

进入经典数据集