likhithv/knowledgemesh-benchmark-eval

Name: likhithv/knowledgemesh-benchmark-eval
Creator: likhithv
Published: 2026-03-24 18:23:34
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/likhithv/knowledgemesh-benchmark-eval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - question-answering - text-generation language: - en tags: - fine-tuning - evaluation - knowledge-graph - benchmark - medical - financial pretty_name: KnowledgeMesh Benchmark Eval Sets size_categories: - 1K<n<10K --- # KnowledgeMesh Benchmark Evaluation Sets Evaluation datasets from the paper **"Knowledge Graph-Guided Fine-Tuning Data Generation: A Rigorous Benchmark"** — a controlled study comparing KnowledgeMesh (KG-guided) vs Meta Synthetic Data Kit (chunk-based) approaches for generating fine-tuning data. ## Dataset Files | File | N | Source | Purpose | |---|---|---|---| | `km_test_473.jsonl` | 473 | KnowledgeMesh pipeline | Primary eval set (KM-generated, same pipeline as training data) | | `independent_eval_955.jsonl` | 955 | Gemini 2.5 Flash | Independent eval set (different model, no KG structure — no stylistic bias) | ## Why Two Eval Sets? The primary set (n=473) was generated by the KM pipeline — sharing KG traversal structure with the training data creates a structural style bias. The **independent set (n=955, Gemini-generated) is the primary claim**: it uses a different model family, different generation style, and neither model has a stylistic advantage. See the paper for full methodology. ## Schema Each line is a JSON object: ```json { "messages": [ {"role": "user", "content": "<question>"}, {"role": "assistant", "content": "<reference answer>"} ], "domain": "financial | medical", "qa_type": "atomic | aggregated | multihop | chain_of_thought", "difficulty": "easy | medium | hard", "evidence_span": "<verbatim source text the answer is grounded in>" } ``` The `independent_eval_955.jsonl` set includes `difficulty` and `evidence_span` fields. The `km_test_473.jsonl` set includes `domain` and `qa_type`. ## Source Documents Questions are grounded in: - **Financial**: Apple Inc. Form 10-K (fiscal year 2023) - **Medical**: PubMed abstracts (multi-domain biomedical literature) ## Benchmark Results | Model | Primary (n=473) | Independent (n=955) | |---|---|---| | Base (no fine-tuning) | 1.79 | 1.96 | | Meta SDK (chunk-based) | 1.93 | 2.17 | | **KnowledgeMesh** | **2.47** | **2.90** | | **Delta (KM − Meta SDK)** | **+0.54** | **+0.72** | Judge: Gemini 2.5 Flash, 4-dimension pointwise scoring (1–5), p < 0.0001, Cohen's d = 0.57 on independent set. ## Models The LoRA adapters evaluated on these datasets: - **KM fine-tuned**: [`likhithv/km-full-model`](https://huggingface.co/likhithv/km-full-model) — trained on 4,361 KG-guided samples - **Meta SDK baseline**: [`likhithv/meta-sdk-baseline`](https://huggingface.co/likhithv/meta-sdk-baseline) — trained on 1,209 chunk-based samples Both are LoRA adapters on top of `Qwen/Qwen3.5-4B`. ## Citation ```bibtex @misc{knowledgemesh2026, title={Knowledge Graph-Guided Fine-Tuning Data Generation: A Rigorous Benchmark}, author={Likhith V}, year={2026}, howpublished={https://huggingface.co/datasets/likhithv/knowledgemesh-benchmark-eval} } ```

提供机构：

likhithv

5,000+

优质数据集

54 个

任务类型

进入经典数据集