five

freelawproject/opinions-synthetic-query-512

收藏
Hugging Face2025-03-07 更新2025-04-19 收录
下载链接:
https://hf-mirror.com/datasets/freelawproject/opinions-synthetic-query-512
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个由Free Law Project精选并创建的数据集,从opinions-metadata数据集中选取了训练集部分。该数据集用于微调编码器模型进行语义搜索,具有512个上下文窗口。数据被划分为训练集和验证集,确保在opinion_id、cluster_id、docket_id和docket_number上没有重叠。每个意见被分割成最多480个token的块(使用bert-base-cased tokenizer进行tokenization),并且有2个句子的重叠。这些块被提供给GPT-4o,通过系统提示生成相关和不相关的查询。

This dataset is curated and created by the Free Law Project, selecting the train split from the opinions-metadata dataset. It is used for fine-tuning encoder models for semantic search with a 512 context window. The data is split into train and dev sets, ensuring no overlap in opinion_id, cluster_id, docket_id, and docket_number. Each opinion is split into chunks of at most 480 tokens (tokenized using the bert-base-cased tokenizer) with a 2-sentence overlap. These chunks are provided to GPT-4o with a system prompt to generate both relevant and irrelevant queries.
提供机构:
freelawproject
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作