freelawproject/opinions-synthetic-query-512

Name: freelawproject/opinions-synthetic-query-512
Creator: freelawproject
Published: 2025-03-07 00:22:17
License: 暂无描述

Hugging Face2025-03-07 更新2025-04-19 收录

下载链接：

https://hf-mirror.com/datasets/freelawproject/opinions-synthetic-query-512

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个由Free Law Project精选并创建的数据集，从opinions-metadata数据集中选取了训练集部分。该数据集用于微调编码器模型进行语义搜索，具有512个上下文窗口。数据被划分为训练集和验证集，确保在opinion_id、cluster_id、docket_id和docket_number上没有重叠。每个意见被分割成最多480个token的块（使用bert-base-cased tokenizer进行tokenization），并且有2个句子的重叠。这些块被提供给GPT-4o，通过系统提示生成相关和不相关的查询。

This dataset is curated and created by the Free Law Project, selecting the train split from the opinions-metadata dataset. It is used for fine-tuning encoder models for semantic search with a 512 context window. The data is split into train and dev sets, ensuring no overlap in opinion_id, cluster_id, docket_id, and docket_number. Each opinion is split into chunks of at most 480 tokens (tokenized using the bert-base-cased tokenizer) with a 2-sentence overlap. These chunks are provided to GPT-4o with a system prompt to generate both relevant and irrelevant queries.

提供机构：

freelawproject

5,000+

优质数据集

54 个

任务类型

进入经典数据集