newmindai/contract-retrieval
收藏Hugging Face2026-01-23 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/newmindai/contract-retrieval
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个土耳其法律问答检索数据集,采用MTEB(大规模文本嵌入基准)格式构建,包含三个核心组件:查询(法律问题)、语料库(法律文档片段,包括收入分成协议、能源销售协议和银行账户质押协议)和默认(查询-语料库映射矩阵)。数据集统计显示,共有272个查询、272个语料库条目和272个默认条目。数据分布显示,查询主要来自三种文档类型:收入分成协议(57.7%)、能源销售协议(23.2%)和银行账户质押协议(19.1%)。数据集生成采用多层AI架构,包括生成层(使用OpenAI GPT-4o-mini和Google Gemini 2.0 Flash生成问题)、批评层(使用OpenAI GPT-4o进行质量控制)和融合层(使用Google Gemini 2.5 Pro合并结果)。数据集适用于土耳其法律文档检索系统、问答系统、嵌入模型评估、RAG(检索增强生成)应用和MTEB基准测试。
This dataset is a Turkish legal question-answer retrieval dataset. Structured in MTEB (Massive Text Embedding Benchmark) format, it consists of three core components: Queries (legal questions), Corpus (legal document segments including Revenue Sharing Agreement, Energy Sales Agreement, and Bank Account Pledge Agreement), and Default (query-corpus mapping matrix). Dataset statistics show 272 queries, 272 corpus entries, and 272 default entries. Data distribution indicates queries are primarily from three document types: Revenue Sharing Agreement (57.7%), Energy Sales Agreement (23.2%), and Bank Account Pledge Agreement (19.1%). The dataset was generated using a multi-layered AI architecture, including a Generator Layer (using OpenAI GPT-4o-mini and Google Gemini 2.0 Flash for question generation), a Critic Layer (using OpenAI GPT-4o for quality control), and a Fuser Layer (using Google Gemini 2.5 Pro for merging results). The dataset is suitable for Turkish legal document retrieval systems, question-answering systems, embedding model evaluation, RAG (Retrieval Augmented Generation) applications, and MTEB benchmark testing.
提供机构:
newmindai



