BAEM1N/Korean-RAG-LLM-Judge-Benchmark

Name: BAEM1N/Korean-RAG-LLM-Judge-Benchmark
Creator: BAEM1N
Published: 2026-04-28 07:48:28
License: 暂无描述

Hugging Face2026-04-28 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/BAEM1N/Korean-RAG-LLM-Judge-Benchmark

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - ko size_categories: - 100K<n<1M task_categories: - question-answering - text-retrieval tags: - rag - llm-as-judge - korean - benchmark - cross-validation - rrf configs: - config_name: default data_files: - split: qa path: data/qa.parquet - split: retrieval path: data/retrieval.parquet - split: cand_answers path: data/cand_answers.parquet - split: judge_scores path: data/judge_scores.parquet --- # Korean RAG LLM-as-Judge Cross-Validation Benchmark > 한국어 RAG 답변 품질을 **46개 LLM × 9개 judge** 매트릭스로 cross-validate한 데이터셋. allganize/RAG-Evaluation-Dataset-KO 위에 동일한 retrieval 파이프라인을 깔고, **변수는 생성모델·판단모델만** 변화시킨 통제 실험. ## 🔗 Companion repos | Resource | Location | |---|---| | **Dataset** (this repo: data, leaderboards, metadata) | HF: BAEM1N/Korean-RAG-LLM-Judge-Benchmark | | **Toolkit** (judge runner, RRF, examples) | [GitHub: BAEM1N/korean-rag-llm-judge-toolkit](https://github.com/BAEM1N/korean-rag-llm-judge-toolkit) | | **Methodology blog** | [baem1n.github.io](https://baem1n.github.io) — `rag-llm-judge-*` posts | ## TL;DR - **300 Q&A** (allganize 인용) × **46 cand LLM** 답변 × **9 judge LLM** 채점 = **456,000 judge calls** - 모든 cand 가 **같은 retrieved chunks** (gemma-embed-300m, FAISS top-5) 를 받음 → 답변 차이 = 순수 LLM 능력 - **RRF 통합 ranking**: Q3 1위 `gpt-oss_120b`, Q4 1위 `gpt-5.4-pro` - 모든 fallback / supplemental 처리 메모 포함 (Opus 4.7 → 4.6 안전 거부 회피 등) ## 통제 변수 (모든 quadrant 공통) | 단계 | 고정값 | 결정 근거 | |---|---|---| | Parser | `pymupdf4llm` | Phase 1 winner | | Chunking | 500 chars / 100 overlap | Phase 2 winner | | VectorStore | FAISS | Phase 3 winner | | **Embedding** | **gemma-embed-300m (768d)** | Phase 4 rank 2 | | Retrieval | top-5 cosine | allganize 원본 k=6 근사 | → 같은 query → same retrieved chunks → 모든 46 cand 가 동일한 context로 답변 생성. 답변 차이 = 순수 LLM 답변 능력. ## 4-Quadrant Matrix ``` Local judge (8) API judge (9) ┌─────────────────────────┬─────────────────────────┐ Local-gen │ Q1 (115,200 calls) │ Q3 (129,600 calls) ✅ │ (12 LLM) │ partial │ 100% │ ├─────────────────────────┼─────────────────────────┤ API-gen │ Q2 (326,400 calls) │ Q4 (326,400 calls) ✅ │ (34 LLM) │ partial │ 100% │ └─────────────────────────┴─────────────────────────┘ ``` 본 release 는 **Q3 + Q4** (456,000 calls, 100% 완료) 데이터. Q1/Q2 는 후속 release 예정. ## RRF (Reciprocal Rank Fusion) 통합 ranking ``` RRF_score(c) = Σ 1 / (k + rank_j(c)) k = 60 (관례) ``` 여러 judge 의 ranking 을 합의 점수로 변환. 단순 평균보다 outlier judge 의 영향 작음. ### Q3 (Local cand × API judge) Top 12 | Rank | Candidate | RRF | |---|---|---| | 🥇 | `gpt-oss_120b` | 0.1462 | | 🥈 | `gpt-oss_20b` | 0.1445 | | 🥉 | `qwen3.5_122b-a10b-q4_K_M_think` | 0.1441 | | 4 | `qwen3.5_27b-q8_0_nothink` | 0.1411 | | 5 | `qwen3.5_122b-a10b-q4_K_M_nothink` | 0.1387 | | 6 | `exaone3.5_32b` | 0.1341 | | 7 | `mistral-small_24b` | 0.1341 | | 8 | `phi4_14b` | 0.1335 | | 9 | `deepseek-r1_70b_nothink` | 0.1313 | | 10 | `qwen3.5_9b-q4_K_M_nothink` | 0.1288 | | 11 | `qwen3.5_9b-q8_0_nothink` | 0.1270 | | 12 | `lfm2_24b` | 0.1250 | ### Q4 (API cand × API judge) Top 10 | Rank | Candidate | RRF | |---|---|---| | 🥇 | `gpt-5.4-pro` | 0.1296 | | 🥈 | `gpt-5.4` | 0.1293 | | 🥉 | `x-ai/grok-4.20` | 0.1273 | | 4 | `gpt-5.4-mini` | 0.1224 | | 5 | `moonshotai/kimi-k2.5` | 0.1220 | | 6 | `moonshotai/kimi-k2.6` | 0.1194 | | 7 | `claude-sonnet-4-6` | 0.1176 | | 8 | `gemini-3-flash-preview` | 0.1163 | | 9 | `claude-opus-4-7` | 0.1152 | | 10 | `claude-sonnet-4-6-thinking` | 0.1132 | 전체 ranking → `leaderboards/rrf_combined.csv` ## 파일 구조 ``` data/ ├── qa.parquet # 300 Q&A (allganize 인용) ├── retrieval.parquet # 300 × top-5 chunks (gemma-embed-300m) ├── cand_answers.parquet # 46 cand × 300 q = 13,800 rows └── judge_scores.parquet # long format, 456,000 rows leaderboards/ ├── q3_local-cand_api-judge.parquet # 12 × 9 = 108 cells ├── q4_api-cand_api-judge.parquet # 34 × 8 = 272 cells └── rrf_combined.csv # final RRF ranking metadata/ ├── cand_models.json # 46 LLM spec ├── judge_models.json # 9 judge spec └── pipeline.json # parser/chunk/VS/embed 설정 ``` ## 컬럼 스펙 ### `qa.parquet` | 컬럼 | 타입 | 설명 | |---|---|---| | `qid` | string | `q000` ~ `q299` | | `domain` | string | finance / public / medical / law / commerce | | `question` | string | 질문 (한국어) | | `target_answer` | string | 정답 (allganize) | | `target_file_name` | string | 원본 PDF | | `target_page_no` | string | 정답 페이지 | | `context_type` | string | paragraph / table / image | ### `retrieval.parquet` | 컬럼 | 타입 | 설명 | |---|---|---| | `qid` | string | | | `embed_model` | string | `gemma-embed-300m` | | `top_k` | int | 5 | | `retrieved_files` | list[string] | top-5 source files | | `retrieved_pages` | list[int] | top-5 source pages | | `context_concatenated` | string | top-5 chunks 합친 context | ### `cand_answers.parquet` | 컬럼 | 타입 | 설명 | |---|---|---| | `qid` | string | | | `cand_id` | string | LLM 식별자 (e.g. `gpt-5.4`, `qwen3.5_122b-a10b-q4_K_M_think`) | | `cand_family` | string | `openai`, `anthropic`, `google`, `qwen3.5`, ... | | `cand_size` | string | `120b`, `27b`, `api`, ... | | `cand_quantization` | string | `Q4_K_M`, `Q8_0`, `` (API 는 빈값) | | `cand_runtime` | string | `local-llamacpp` / `api` | | `generated_answer` | string | LLM 답변 | | `input_tokens` | int? | | | `output_tokens` | int? | | | `latency_sec` | float? | | ### `judge_scores.parquet` (long format) | 컬럼 | 타입 | 설명 | |---|---|---| | `qid` | string | | | `cand_id` | string | | | `judge_id` | string | judge LLM (e.g. `claude-sonnet-4-6`) | | `metric` | string | `similarity` / `correctness` / `completeness` / `faithfulness` | | `score` | int | 1–5 | | `quadrant` | string | `Q3` (local cand) or `Q4` (api cand) | ## 채점 프로토콜 (allganize 기반) - 4 metric: `similarity`, `correctness`, `completeness`, `faithfulness` - 1–5 점 scale - threshold = 4 - majority: 4 metric 중 ≥2개가 ≥4 → "O" (정답 처리) - accuracy = O / 300 ## Fallback / 재처리 메모 데이터의 일부는 다음 처리를 거쳤습니다 (`metadata/judge_models.json` 참조): - **Anthropic Opus 4.7 → Opus 4.6 fallback**: Q3 52건 + Q4 ~128건. Opus 4.7 이 `q142`/`q258` 등 의료 prompt 를 `stop_reason: refusal` 로 거부 → 11가지 우회 (system disclaimer, adaptive thinking, backtick wrapping 등) 모두 실패 → Opus 4.6 으로 fallback. `judge_id = claude-opus-4-7` 행에 일부 4.6 결과 포함. - **Sonnet 4.6 retry**: max_tokens=64 가 작아 분석 텍스트만 출력하고 정수 안 적은 케이스 — `max_tokens=1024` 로 재호출하여 보강. - **Empty cand 보강**: kimi-k2.6/v4-pro 등 16 cand-q pair 의 첫 entry 가 빈 답변. 후속 retry entry 선택하여 재평가. - **gpt-5.4-pro q181/q223**: cand 파일 누락 (298/300). Responses API 로 직접 호출 후 8 judge 재평가. ## Source Dataset Attribution 이 dataset 의 Q&A 300건과 PDF 58종은 **[allganize/RAG-Evaluation-Dataset-KO](https://huggingface.co/datasets/allganize/RAG-Evaluation-Dataset-KO)** (MIT License) 에서 가져왔습니다. 이 release 는 그 위에 다음을 추가합니다: - 46 LLM 의 RAG 답변 (Phase 5 단일변수 sweep) - 9 API judge 의 cross-validation 점수 (4 metric × 1-5 scale) - gemma-embed-300m 통제 retrieval - RRF-fused ranking allganize 의 origin question/reference 를 보존하면서 LLM-as-judge 연구를 위한 깊이 있는 채점 데이터를 제공합니다. ## Citation ```bibtex @dataset{baem1n_korean_rag_judge_2026, title = {Korean RAG LLM-as-Judge Cross-Validation Benchmark}, author = {BAEM1N}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/BAEM1N/Korean-RAG-LLM-Judge-Benchmark}, note = {Built on top of allganize/RAG-Evaluation-Dataset-KO} } @dataset{allganize_rag_eval_2024, title = {RAG-Evaluation-Dataset-KO}, author = {Allganize}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/allganize/RAG-Evaluation-Dataset-KO} } ``` ## 사용 예시 ```python from datasets import load_dataset # Load all splits ds = load_dataset("BAEM1N/Korean-RAG-LLM-Judge-Benchmark") # Q3 leaderboard import pandas as pd lb = pd.read_csv("https://huggingface.co/datasets/BAEM1N/Korean-RAG-LLM-Judge-Benchmark/resolve/main/leaderboards/rrf_combined.csv") print(lb[lb['quadrant'] == 'Q3'].head(5)) # 한 cand 의 답변과 judge 점수 비교 import pyarrow.parquet as pq cand = pq.read_table("data/cand_answers.parquet").to_pandas() judges = pq.read_table("data/judge_scores.parquet").to_pandas() q000_gpt = cand[(cand.qid == 'q000') & (cand.cand_id == 'gpt-5.4')] q000_judges = judges[(judges.qid == 'q000') & (judges.cand_id == 'gpt-5.4')] print(q000_gpt.iloc[0].generated_answer) print(q000_judges.pivot(index='judge_id', columns='metric', values='score')) ``` ## License MIT (allganize source 와 호환). ## Changelog - **2026-04-28** v0.1: Phase A + B release (Q3 + Q4, 100% 완료). Q1/Q2 는 후속 release.

提供机构：

BAEM1N

5,000+

优质数据集

54 个

任务类型

进入经典数据集