Ko-LongRAG
收藏魔搭社区2025-12-23 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/LGAI-EXAONE/Ko-LongRAG
下载链接
链接失效反馈官方服务:
资源简介:
<p align="center">
<img src="image.png" alt="Ko-LongRAG" width="50%">
</p>
## **Abstract**
The rapid advancement of large language models (LLMs) significantly enhances long-context Retrieval-Augmented Generation (RAG), yet existing benchmarks focus primarily on English. This leaves low-resource languages without comprehensive evaluation frameworks, limiting their progress in retrieval-based tasks. To bridge this gap, we introduce **Ko-LongRAG**, the first **Ko**rean **long**-context **RAG** benchmark. Unlike conventional benchmarks that depend on external retrievers, Ko-LongRAG adopts a retrieval-free approach designed around Specialized Content Knowledge (SCK), enabling controlled and high-quality QA pair generation without the need for an extensive retrieval infrastructure. Our evaluation shows that o1 model achieves the highest performance among proprietary models, while EXAONE 3.5 leads among open-sourced models. Additionally, various findings confirm Ko-LongRAG as a reliable benchmark for assessing Korean long-context RAG capabilities and highlight its potential for advancing multilingual RAG research.
## **Dataset Details**
- **Composition**: 600 total examples
- **singledocQA** (300): extraction-style QA grounded in a single document
- **multidocQA** (300): comparison/bridge reasoning across documents within a domain cluster
- **Fields (schema)**: `id`, `titles` (list[str]), `context` (str), `question` (str), `answer` (str), `prompt` (str), `task` (str; `"singledocQA"` or `"multidocQA"`)
- **Context lengths (approx.)**: single ≈ 2,915 tokens; multi ≈ 14,092 tokens
- **Unanswerable share**: ≈ 16.6%
> For construction protocol, prompts, human verification checks, and extended statistics, please refer to the accompanying paper and repository notes. Guidance for dataset cards and their structure follows the Hugging Face documentation.
## **Usage**
```python
from datasets import load_dataset
ds = load_dataset("LGAI-EXAONE/Ko-LongRAG", split="test")
print(ds)
print(ds[0]["task"], ds[0]["question"])
```
## **Data Fields**
- `id` — unique identifier (string)
- `titles` — list of section titles included in the context (list[string])
- `context` — concatenated long passages (string)
- `question` — Korean question (string)
- `answer` — short answer string (or "unanswerable" when appropriate)
- `prompt` — prompt used during data creation/evaluation (string, optional)
- `task` — `"singledocQA"` or `"multidocQA"`
## **Benchmark Design (Brief)**
- **Domain-aware clustering** groups documents by topic/keywords to form long contexts suitable for QA.
- **Question generation** distinguishes extraction-style (single) from cross-document comparison/bridge (multi).
- **Quality control** uses a human checklist to validate question–answer–context consistency.
- **Unanswerable cases** are systematically included to assess reliability and calibration under retrieval failure.
## **License**
This dataset is released under **CC BY-NC 4.0** (Attribution–NonCommercial 4.0). Please review the Creative Commons BY-NC 4.0 terms before reuse.
**Additional Terms (model usage):**
This dataset was created using **OpenAI GPT-4o**. In addition to the license above, **the dataset is subject to OpenAI’s Terms of Use and related policies** governing use of model outputs. This means the dataset **must not be used to develop competing models** where such use conflicts with those terms.
> This dataset is licensed under CC BY-NC 4.0, and is subject to the Terms of Use of the model (OpenAI GPT-4o) used in its creation.
## **Citation**
```bibtex
@misc{KoLongRAG-2025,
title = {Ko-LongRAG: A Korean Long-Context RAG Benchmark Built with a Retrieval-Free Approach},
author = {Ko-LongRAG Authors},
year = {2025},
note = {Preprint},
}
```
## **Contact**
For questions or issues, please open an issue on the dataset repository or contact the maintainers.
<p align="center"><img src="image.png" alt="Ko-LongRAG" width="50%"></p>
## **摘要**
大语言模型(Large Language Model,LLM)的快速发展显著推动了长上下文检索增强生成(Retrieval-Augmented Generation,RAG)技术的进步,但现有基准测试主要聚焦于英语语种。这导致低资源语言缺乏完善的评估框架,限制了其在检索类任务中的发展。为填补这一空白,我们推出了**Ko-LongRAG**——首个**韩语长上下文检索增强生成**基准测试集。与依赖外部检索器的传统基准测试不同,Ko-LongRAG采用围绕专业内容知识(Specialized Content Knowledge,SCK)设计的无检索方案,无需大规模检索基础设施即可实现可控且高质量的问答对生成。我们的评估结果显示,o1模型在闭源模型中表现最优,而EXAONE 3.5则在开源模型中位居首位。此外,多项实验结果证实,Ko-LongRAG是评估韩语长上下文RAG能力的可靠基准测试集,并为推进多语言RAG研究提供了潜力。
## **数据集详情**
- **数据集构成**:共计600个示例
- **单文档问答(singledocQA)**(300个):基于单篇文档的抽取式问答任务
- **多文档问答(multidocQA)**(300个):针对同一主题簇内多篇文档进行对比或桥接推理的问答任务
- **数据字段(Schema)**:`id`、`titles`(字符串列表)、`context`(字符串)、`question`(字符串)、`answer`(字符串)、`prompt`(字符串)、`task`(字符串,取值为"singledocQA"或"multidocQA")
- **上下文长度(近似值)**:单文档场景约2,915个Token;多文档场景约14,092个Token
- **无答案样本占比**:约16.6%
> 关于数据集构建流程、提示词、人工验证规则及拓展统计数据,请参阅配套论文与仓库说明。数据集卡片及其结构的编写规范遵循Hugging Face官方文档。
## **使用方法**
python
from datasets import load_dataset
ds = load_dataset("LGAI-EXAONE/Ko-LongRAG", split="test")
print(ds)
print(ds[0]["task"], ds[0]["question"])
## **数据字段说明**
- `id`:唯一标识符(字符串类型)
- `titles`:上下文包含的章节标题列表(字符串列表类型)
- `context`:拼接后的长文本段落(字符串类型)
- `question`:韩语问题(字符串类型)
- `answer`:简短答案字符串(如需时可标注为"unanswerable"(无答案))
- `prompt`:数据集构建与评估阶段使用的提示词(字符串类型,可选)
- `task`:任务类型,取值为"singledocQA"或"multidocQA"
## **基准测试设计(精简版)**
- **主题感知聚类**:按主题与关键词对文档进行聚类,生成适用于问答任务的长上下文
- **问答生成**:区分抽取式(单文档)与跨文档对比/桥接推理(多文档)两类任务
- **质量管控**:采用人工核查清单验证问答-上下文一致性
- **无答案样本**:系统纳入无答案样本,用于评估模型在检索失效场景下的可靠性与校准能力
## **许可证**
本数据集采用**CC BY-NC 4.0**(署名-非商业性使用4.0国际协议)进行开源。在重新使用前,请务必审阅Creative Commons BY-NC 4.0的相关条款。
**附加使用条款(模型相关):**
本数据集通过**OpenAI GPT-4o**生成构建。除上述许可证条款外,本数据集同时受OpenAI使用条款及相关政策约束,这些条款管控模型输出内容的使用范围。这意味着本数据集**不得用于开发与OpenAI形成竞争关系的模型**,若相关使用行为与上述条款冲突则严禁执行。
> 本数据集采用CC BY-NC 4.0许可证进行开源,同时受到构建过程中所使用模型(OpenAI GPT-4o)的使用条款约束。
## **引用格式**
bibtex
@misc{KoLongRAG-2025,
title = {Ko-LongRAG: A Korean Long-Context RAG Benchmark Built with a Retrieval-Free Approach},
author = {Ko-LongRAG Authors},
year = {2025},
note = {Preprint},
}
## **联系方式**
如有任何疑问或问题,请在数据集仓库中提交Issue,或联系维护团队。
提供机构:
maas
创建时间:
2025-09-19



