Lemoncoke/Marathon
收藏数据集概述
基本信息
- 名称: Marathon
- 语言: 英语
- 创建者: 专家生成和机器生成
- 许可证: MIT
- 多语言性: 单语种
- 大小: 1K<n<10K
- 标签: long context
- 任务类别: question-answering
- 任务ID: open-domain-qa
数据集配置
- 配置名称: default
- 数据文件:
- 分割: test
- 路径: "marathon.json"
数据集描述
Marathon 基准是一个新的长上下文多选基准,主要基于 LooGLE,并包含来自 LongBench 的一些原始数据。上下文长度可达 200K+。Marathon 基准包含六个任务:理解与推理、多信息检索、时间线重排、计算、段落检索 和 短依赖问答。每个测试案例包括一个长上下文、一个问题和多个候选选项。大型语言模型(LLMs)需要根据测试中的长上下文从给定选项中选择正确答案。
数据实例
一个测试示例如下: json { "id": "7", "type": "comprehension_and_reasoning", "context": " Early life. Picardo was born in Jerez de la Frontera, in the Province of Cádiz in Andalucía, Spain on 18 June 1919. His father was Alvaro Picardo de Celis and his mothers family name was Castellón. He had four brothers, one of whom died in infancy. His father died in 1929 when Picardo was ten years old. With his mother and his brothers he moved to Madrid, Spain. [Truncated for display purpose] ", "question": "How many people were in Picardos family when he was twelve?", "options": { "A": "five", "B": "eight", "C": "nine", "D": "ten" }, "length": 268760 }
优化方法与嵌入模型
- 优化方法:
- 普通方法 (Vanilla)
- 检索增强生成 (RAG)
- 长语言模型提示压缩 (PC)
- 嵌入模型:
- OpenAI: text-embedding-ada-002
- Jina: Jina-Embedding-base
模型性能
| 标签 | 模型 | 参数 | 上下文窗口 | 方法 | 嵌入 | 平均准确率 |
|---|---|---|---|---|---|---|
| 🏐 | GPT-4 | - | 128K | 🏐 Vanilla | - | 78.59 |
| 🎾🍔 | Yi-chat | 34B | 200K | 🎾 RAG | 🍔 Jina | 63.81 |
| 🎾🍿 | Yi-chat | 34B | 200K | 🎾 RAG | 🍿 OpenAI | 63.56 |
| 🎾🍿 | Tutu2-DPO | 70B | 8K | 🎾 RAG | 🍿 OpenAI | 61.97 |
| 🎾🍔 | Tutu2-DPO | 70B | 8K | 🎾 RAG | 🍔 Jina | 61.52 |
| 🎾🍔 | Qwen | 14B | 8K | 🎾 RAG | 🍔 Jina | 58.12 |
| 🏐 | ChatGPT | - | 16K | 🏐 Vanilla | - | 57.37 |
| 🏐 | Yi-chat | 34B | 200K | 🏐 Vanilla | - | 55.91 |
| 🎾🍔 | Beluga2 | 70B | 4K | 🎾 RAG | 🍔 Jina | 55.72 |
| 🏐 | ChatGLM3 | 6B | 32K | 🏐 Vanilla | - | 55.05 |
| 🎾🍔 | Zephyr | 7B | 32K | 🎾 RAG | 🍔 Jina | 53.79 |
| 🎾🍿 | Qwen | 14B | 8K | 🎾 RAG | 🍿 OpenAI | 53.46 |
| 🏀 | Beluga2 | 70B | 4K | 🏀 PC | - | 52.29 |
| 🎾🍔 | Mistral | 7B | 32K | 🎾 RAG | 🍔 Jina | 52.04 |
| 🎾🍿 | Alfred | 40B | 8K | 🎾 RAG | 🍿 OpenAI | 51.35 |
| 🎾🍔 | Alfred | 40B | 8K | 🎾 RAG | 🍔 Jina | 51.24 |
| 🎾🍿 | ChatGLM3 | 6B | 32K | 🎾 RAG | 🍿 OpenAI | 50.99 |
| 🎾🍔 | ChatGLM3 | 6B | 32K | 🎾 RAG | 🍔 Jina | 50.60 |
| 🎾🍿 | Mistral | 7B | 32K | 🎾 RAG | 🍿 OpenAI | 50.18 |
| 🎾🍿 | Zephyr | 7B | 32K | 🎾 RAG | 🍿 OpenAI | 49.63 |
| 🏐 | Beluga2 | 70B | 4K | 🏐 Vanilla | - | 49.51 |
| 🏀 | Yi | 34B | 200K | 🏀 PC | - | 48.66 |
| 🎾🍿 | Beluga2 | 70B | 4K | 🎾 RAG | 🍿 OpenAI | 48.24 |
| 🏀 | ChatGLM3 | 6B | 32K | 🏀 PC | - | 47.91 |
| 🏀 | Tulu2-DPO | 70B | 8K | 🏀 PC | - | 46.56 |
| 🏀 | Qwen | 14B | 8K | 🏀 PC | - | 44.12 |
| 🏐 | Mistral | 7B | 32K | 🏐 Vanilla | - | 39.81 |
| 🏐 | Qwen | 14B | 8K | 🏐 Vanilla | - | 39.27 |
| 🏀 | Alfred | 40B | 8K | 🏀 PC | - | 38.82 |
| 🏐 | Zephyr | 7B | 32K | 🏐 Vanilla | - | 37.97 |
| 🏐 | Tulu2-DPO | 7B | 8K | 🏐 Vanilla | - | 37.92 |
| 🎾🍔 | Longchat | 13B | 16K | 🎾 RAG | 🍔 Jina | 37.78 |
| 🏐 | Alfred | 40B | 8K | 🏐 Vanilla | - | 37.31 |
| 🏀 | Mistral | 7B | 32K | 🏀 PC | - | 37.01 |
| 🏐 | Longchat | 13B | 16K | 🏐 Vanilla | - | 35.87 |
| 🏀 | Longchat | 13B | 16K | 🏀 PC | - | 35.61 |
| 🏀 | Zephyr | 7B | 32K | 🏀 PC | - | 30.23 |
| 🎾🍿 | Longchat | 13B | 16K | 🎾 RAG | 🍿 OpenAI | 29.95 |



