newmindai/contract-retrieval

Name: newmindai/contract-retrieval
Creator: newmindai
Published: 2026-01-23 14:56:37
License: 暂无描述

Hugging Face2026-01-23 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/newmindai/contract-retrieval

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个土耳其法律问答检索数据集，采用MTEB（大规模文本嵌入基准）格式构建，包含三个核心组件：查询（法律问题）、语料库（法律文档片段，包括收入分成协议、能源销售协议和银行账户质押协议）和默认（查询-语料库映射矩阵）。数据集统计显示，共有272个查询、272个语料库条目和272个默认条目。数据分布显示，查询主要来自三种文档类型：收入分成协议（57.7%）、能源销售协议（23.2%）和银行账户质押协议（19.1%）。数据集生成采用多层AI架构，包括生成层（使用OpenAI GPT-4o-mini和Google Gemini 2.0 Flash生成问题）、批评层（使用OpenAI GPT-4o进行质量控制）和融合层（使用Google Gemini 2.5 Pro合并结果）。数据集适用于土耳其法律文档检索系统、问答系统、嵌入模型评估、RAG（检索增强生成）应用和MTEB基准测试。

This dataset is a Turkish legal question-answer retrieval dataset. Structured in MTEB (Massive Text Embedding Benchmark) format, it consists of three core components: Queries (legal questions), Corpus (legal document segments including Revenue Sharing Agreement, Energy Sales Agreement, and Bank Account Pledge Agreement), and Default (query-corpus mapping matrix). Dataset statistics show 272 queries, 272 corpus entries, and 272 default entries. Data distribution indicates queries are primarily from three document types: Revenue Sharing Agreement (57.7%), Energy Sales Agreement (23.2%), and Bank Account Pledge Agreement (19.1%). The dataset was generated using a multi-layered AI architecture, including a Generator Layer (using OpenAI GPT-4o-mini and Google Gemini 2.0 Flash for question generation), a Critic Layer (using OpenAI GPT-4o for quality control), and a Fuser Layer (using Google Gemini 2.5 Pro for merging results). The dataset is suitable for Turkish legal document retrieval systems, question-answering systems, embedding model evaluation, RAG (Retrieval Augmented Generation) applications, and MTEB benchmark testing.

提供机构：

newmindai

5,000+

优质数据集

54 个

任务类型

进入经典数据集