Name: vania-janet/mt-rag-benchmark-data
Creator: vania-janet
Published: 2026-03-03 05:42:09
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/vania-janet/mt-rag-benchmark-data

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-retrieval language: - en tags: - retrieval - conversational-search - RAG - benchmark - mt-rag - faiss - bm25 - splade - dense-retrieval - hybrid-retrieval pretty_name: MT-RAG Benchmark — Task A Retrieval (Indices, Data & Experiments) size_categories: - 1B<n<10B --- # MT-RAG Benchmark — Task A Retrieval Artefactos completos del sistema de recuperación desarrollado para el benchmark **MT-RAG** (Multi-Turn Retrieval-Augmented Generation), Task A. Incluye índices precomputados, corpora procesados, consultas reescritas y resultados de todos los experimentos. --- ## Contenido del repositorio ``` indices/ ← Índices de recuperación precomputados {dataset}/ Datasets: clapnq · cloud · fiqa · govt bge/ FAISS (BGE-base-1.5) bge-m3/ FAISS (BGE-M3) bm25/ BM25 serializado (.pkl) cohere/ FAISS (Cohere embed-english-v3) splade/ SPLADE matricial (.npz) voyage/ FAISS (Voyage-large-2) data/ passage_level_processed/ ← Corpora en formato passage jsonl (~428 MB) {clapnq|cloud|fiqa|govt}/ corpus.jsonl retrieval_tasks/ ← Queries, qrels y tasks por dataset (~2.8 MB) {dataset}/ *_tasks.jsonl *_questions.jsonl *_rewrite.jsonl *_lastturn.jsonl qrels/dev.tsv rewrites/ ← Versiones de query rewriting cohere_v1/ Command-R rewrite (v1) cohere_v2/ Command-R rewrite (v2 / v4) cohere_v3/ Command-R rewrite (v3 / v5) ← versión final enviada cohere_v3_alt/ Copia alternativa de cohere_v3 own_improved/ Rewrite propio mejorado own_local/ Rewrite propio (modelo local) own_replica/ Rewrite propio réplica hyde/ HyDE (Hypothetical Document Embeddings) multi/ Multi-query submissions/ ← Resultados de retrieval por experimento (~1.9 GB) baselines_rewrite/ Baselines con rewrite (BGE-1.5, BGE-M3, Voyage, SPLADE, BM25) baselines_fullhist/ Baselines con historial completo baselines_replication/ Réplica de baselines originales hybrid_*/ Experimentos hybrid SPLADE + dense (múltiples variantes) rerank_*/ Experimentos con reranking (BGE cross-encoder, Cohere) ablation_*/ Ablaciones (fusión, top-k, RRF-k, profundidad rerank, etc.) statistical_tests/ ← Pruebas estadísticas completas results/ statistical_report.json Reporte estadístico conciso statistical_validation_report.json Validación completa con Bootstrap + Wilcoxon statistical_summary_for_paper.txt Resumen para publicación (baselines y hybrid) ablation_statistical_tests.json Tests estadísticos de ablaciones (137 KB) ablation_statistical_summary.txt Resumen de ablaciones per_query_scores.jsonl Scores por query (experimentos principales) per_query_scores_all_experiments.jsonl Scores por query (todos los experimentos) per_query_scores_ablations.jsonl Scores por query (ablaciones) production_report.json Reporte de producción final thesis_analyses_report.json Análisis para tesis error_analysis.json / error_examples.md Análisis de errores cualitativos scripts/ run_ablation_statistical_tests.py Script que genera ablation_statistical_*.json legacy_statistical_validation.py Script que genera statistical_validation_report.json run_all_analyses.py Orquestador de todos los análisis``` --- ## Datasets utilizados | Dataset | Dominio | # Conversaciones | # Pasajes | |----------|--------------------|-----------------|-----------| | CLAPNQ | Wikipedia (QA) | ~1 500 | ~500 000 | | Cloud | Documentación tech | ~1 200 | ~60 000 | | FiQA | Finanzas | ~1 200 | ~57 000 | | Govt | Gobierno (FDA/EPA) | ~1 200 | ~90 000 | --- ## Modelos de recuperación | Tipo | Modelo | Clave en `indices/` | |--------------|--------------------------------------------|---------------------| | Denso | BAAI/bge-base-en-v1.5 | `bge` | | Denso | BAAI/bge-m3 | `bge-m3` | | Denso | Cohere embed-english-v3.0 | `cohere` | | Denso | voyage-large-2 | `voyage` | | Escaso | naver/splade-v3 | `splade` | | Léxico | BM25 (rank-bm25) | `bm25` | --- ## Pruebas estadísticas (`statistical_tests/`) Validación estadística completa de los experimentos **0-baselines** y **02-hybrid**, más ablaciones. ### Métodos aplicados | Método | Propósito | |--------|-----------| | **Bootstrap CI** (10 000 iters, seed=42) | Intervalos de confianza 95% sobre nDCG@5 | | **Wilcoxon signed-rank** (no paramétrico) | Tests pareados (normalidad rechazada por Shapiro-Wilk en 100% de casos, p < 1e-8) | | **Holm-Bonferroni** (FWER) | Corrección por multiplicidad en todos los conjuntos de tests | | **Cohen's d** | Tamaño del efecto (negligible < 0.2, small < 0.5, medium < 0.8, large ≥ 0.8) | | **Kendall τ** | Concordancia cross-domain de rankings | ### Resultados principales | Hipótesis | Tests | Sobreviven Holm | Conclusión | |-----------|-------|-----------------|------------| | H1: Híbrido > individual | 40 | **32/40** | ✅ Soportada | | Comparaciones de rewrite | 28 | 0/28 | ❌ Efectos pequeños (\|d\| < 0.21) | | Degradación por turno | 24 | 2/24 | Parcial | | Concordancia cross-domain | 6 | 1/6 | Parcial | La fusión híbrida supera significativamente a los componentes individuales en 32 de 40 comparaciones. Las diferencias entre estrategias de rewrite son reales pero no alcanzan significancia estadística tras corrección por multiplicidad. --- ## Experimentos incluidos en `data/submissions/` ### Baselines (`0-baselines`) - `A0_*` — Historial completo sin rewrite (BM25, SPLADE) - `A1_*` — Historial completo sin rewrite (BGE-M3, Voyage) - `A2_*` — Con rewrite Cohere v3 (BGE-1.5, BGE-M3, Voyage, SPLADE, BM25) - `replication_*` — Réplicas de baselines originales MT-RAG ### Hybrid (`02-hybrid`) - `hybrid_splade_bge15_*` — SPLADE + BGE-1.5 (norewrite, rewrite, own, v2, v3) - `hybrid_splade_voyage_*` — SPLADE + Voyage (norewrite, rewrite, own, v2, v3, hyde, multi) ### Rerank (`03-rerank`) - BGE cross-encoder y Cohere rerank sobre hybrid SPLADE+BGE/Voyage ### Ablaciones (`06-12`) - Fusión (RRF vs linear, α) - Top-k de recuperación (100, 200, 500) - RRF k (1, 20, 40, 100) - Profundidad de reranking (50, 100, 200) - Modo de query (lastturn, fullhist, fullctx) - Componentes individuales - Variantes de rewrite --- ## Métricas Cada resultado de experimento incluye `retrieval_results.jsonl` y `metrics.json` con: - **MRR@10**, **NDCG@10** (métricas principales MT-RAG) - **Recall@100**, **MAP@10** --- ## Citar Si usas estos artefactos, por favor cita el trabajo de tesis asociado y el benchmark MT-RAG: ```bibtex @dataset{janet2025mtrag, title = {{MT-RAG} Benchmark Task A — Retrieval Artifacts}, author = {Vania Janet}, year = {2025}, url = {https://huggingface.co/datasets/vania-janet/mt-rag-benchmark-data}, license = {CC-BY-4.0} } ```

license: CC-BY-4.0 task_categories: - 文本检索（text-retrieval） language: - 英语（en） tags: - 检索（retrieval） - 会话式搜索（conversational-search） - 检索增强生成（Retrieval-Augmented Generation, RAG） - 基准测试（benchmark） - 多轮检索增强生成（mt-rag，MT-RAG） - FAISS - BM25 - SPLADE - 稠密检索（dense-retrieval） - 混合检索（hybrid-retrieval） pretty_name: MT-RAG基准测试 — 任务A检索（索引、数据与实验） size_categories: - 10亿 < 样本规模 < 100亿 # MT-RAG基准测试 — 任务A检索本数据集包含为**MT-RAG（多轮检索增强生成，Multi-Turn Retrieval-Augmented Generation）**基准测试任务A开发的完整检索系统工件，涵盖预计算索引、处理后的语料库、重写后的查询以及所有实验的结果。 ## 仓库内容 indices/ ← 预计算检索索引 {dataset}/ 支持数据集：clapnq、cloud、fiqa、govt bge/ FAISS（BGE-base-1.5）索引 bge-m3/ FAISS（BGE-M3）索引 bm25/ 序列化BM25索引（.pkl格式） cohere/ FAISS（Cohere embed-english-v3）索引 splade/ SPLADE矩阵文件（.npz格式） voyage/ FAISS（Voyage-large-2）索引 data/ passage_level_processed/ ← 段落级处理后语料库，格式为jsonl（约428 MB） {clapnq|cloud|fiqa|govt}/ corpus.jsonl retrieval_tasks/ ← 各数据集的查询、qrels与任务文件（约2.8 MB） {dataset}/ *_tasks.jsonl *_questions.jsonl *_rewrite.jsonl *_lastturn.jsonl qrels/dev.tsv rewrites/ ← 查询重写版本集 cohere_v1/ Command-R重写（v1） cohere_v2/ Command-R重写（v2 / v4） cohere_v3/ Command-R重写（v3 / v5） ← 最终提交版本 cohere_v3_alt/ cohere_v3的备用副本 own_improved/ 优化后的自主重写结果 own_local/ 本地模型自主重写结果 own_replica/ 自主重写复现结果 hyde/ HyDE（假设文档嵌入，Hypothetical Document Embeddings）重写 multi/ 多查询重写结果 submissions/ ← 各实验检索结果（约1.9 GB） baselines_rewrite/ 带查询重写的基线模型（BGE-1.5、BGE-M3、Voyage、SPLADE、BM25） baselines_fullhist/ 完整历史上下文基线模型 baselines_replication/ 原始基线模型复现结果 hybrid_*/ SPLADE与稠密检索混合的多变体实验 rerank_*/ 带重排序的实验（BGE交叉编码器、Cohere重排序） ablation_*/ 消融实验（融合策略、top-k、RRF-k、重排序深度等） statistical_tests/ ← 完整统计检验结果集 results/ statistical_report.json 简洁统计报告 statistical_validation_report.json 完整Bootstrap与Wilcoxon验证报告 statistical_summary_for_paper.txt 论文用统计摘要（基线与混合模型结果） ablation_statistical_tests.json 消融实验统计检验文件（137 KB） ablation_statistical_summary.txt 消融实验结果摘要 per_query_scores.jsonl 各查询得分（主实验） per_query_scores_all_experiments.jsonl 所有实验的各查询得分 per_query_scores_ablations.jsonl 消融实验的各查询得分 production_report.json 最终生产报告 thesis_analyses_report.json 学位论文分析报告 error_analysis.json / error_examples.md 定性错误分析报告 scripts/ run_ablation_statistical_tests.py 生成ablation_statistical_*.json的脚本 legacy_statistical_validation.py 生成statistical_validation_report.json的脚本 run_all_analyses.py 全分析流程编排脚本 ## 所用数据集 | 数据集名称 | 领域 | 对话数量 | 段落数量 | |------------|---------------------|----------|------------| | CLAPNQ | 维基百科（问答） | ~1500 | ~500000 | | Cloud | 科技文档 | ~1200 | ~60000 | | FiQA | 金融领域 | ~1200 | ~57000 | | Govt | 政府领域（FDA/EPA） | ~1200 | ~90000 | ## 检索模型 | 模型类型 | 模型名称 | `indices/` 中对应键 | |----------------|--------------------------------------------|---------------------| | 稠密检索 | BAAI/bge-base-en-v1.5 | `bge` | | 稠密检索 | BAAI/bge-m3 | `bge-m3` | | 稠密检索 | Cohere embed-english-v3.0 | `cohere` | | 稠密检索 | voyage-large-2 | `voyage` | | 稀疏检索 | naver/splade-v3 | `splade` | | 词汇检索 | BM25（rank-bm25） | `bm25` | ## 统计检验（`statistical_tests/`）本目录包含对`0-baselines`、`02-hybrid`实验及消融实验的完整统计验证。 ### 所用检验方法 | 检验方法 | 用途 | |-----------------------------------|----------------------------------------------------------------------| | Bootstrap置信区间（10000次迭代，随机种子=42） | 基于nDCG@5的95%置信区间估计 | | Wilcoxon符号秩检验（非参数检验） | 配对样本显著性检验（Shapiro-Wilk检验显示所有样本均不满足正态性，p < 1e-8） | | Holm-Bonferroni校正（FWER） | 多检验多重性校正 | | Cohen's d效应量 | 效应大小评估（可忽略 < 0.2，小效应 < 0.5，中等效应 < 0.8，大效应 ≥ 0.8） | | Kendall τ秩相关系数 | 跨域排名一致性评估 | ### 主要检验结果 | 研究假设 | 检验次数 | 通过Holm校正的检验数 | 结论 | |------------------------------|----------|----------------------|--------------------| | H1：混合检索模型优于单一组件 | 40 | 32/40 | ✅ 得到支持 | | 查询重写策略间的比较 | 28 | 0/28 | ❌ 效应量极小（|d| < 0.21） | | 对话轮次带来的性能退化效应 | 24 | 2/24 | 部分支持 | | 跨域排名一致性 | 6 | 1/6 | 部分支持 | 混合检索融合策略在40组对比中有32组显著优于单一组件模型。不同查询重写策略间的差异虽真实存在，但经多重性校正后未达到统计显著性水平。 ## `data/submissions/` 中包含的实验 ### 基线模型实验（`0-baselines`） - `A0_*` — 无查询重写的完整历史上下文实验（BM25、SPLADE） - `A1_*` — 无查询重写的完整历史上下文实验（BGE-M3、Voyage） - `A2_*` — 使用Cohere v3查询重写的实验（BGE-1.5、BGE-M3、Voyage、SPLADE、BM25） - `replication_*` — 原始MT-RAG基线模型复现结果 ### 混合检索实验（`02-hybrid`） - `hybrid_splade_bge15_*` — SPLADE + BGE-1.5混合检索（无重写、重写、自主重写、v2、v3变体） - `hybrid_splade_voyage_*` — SPLADE + Voyage混合检索（无重写、重写、自主重写、v2、v3、HyDE、多查询变体） ### 重排序实验（`03-rerank`） - 基于SPLADE+BGE/Voyage混合检索结果的BGE交叉编码器与Cohere重排序实验 ### 消融实验（`06-12`） - 融合策略对比（RRF vs 线性融合、α参数调整） - 检索top-k阈值调整（100、200、500） - RRF k参数调整（1、20、40、100） - 重排序深度调整（50、100、200） - 查询输入模式（单轮最后turn、完整历史上下文、完整上下文） - 单一组件检索模型对比实验 - 查询重写策略变体对比 ## 评价指标每个实验结果包含`retrieval_results.jsonl`与`metrics.json`文件，其中包含以下评价指标： - **MRR@10**、**NDCG@10**（MT-RAG核心评价指标） - **Recall@100**、**MAP@10** ## 引用说明若使用本数据集工件，请引用关联的学位论文与MT-RAG基准测试： bibtex @dataset{janet2025mtrag, title = {{MT-RAG} Benchmark Task A — Retrieval Artifacts}, author = {Vania Janet}, year = {2025}, url = {https://huggingface.co/datasets/vania-janet/mt-rag-benchmark-data}, license = {CC-BY-4.0} }

应用场景：