freelawproject/opinions-synthetic-query-512
收藏Hugging Face2025-03-07 更新2025-04-19 收录
下载链接:
https://hf-mirror.com/datasets/freelawproject/opinions-synthetic-query-512
下载链接
链接失效反馈官方服务:
资源简介:
这是一个由Free Law Project精选并创建的数据集,从opinions-metadata数据集中选取了训练集部分。该数据集用于微调编码器模型进行语义搜索,具有512个上下文窗口。数据被划分为训练集和验证集,确保在opinion_id、cluster_id、docket_id和docket_number上没有重叠。每个意见被分割成最多480个token的块(使用bert-base-cased tokenizer进行tokenization),并且有2个句子的重叠。这些块被提供给GPT-4o,通过系统提示生成相关和不相关的查询。
This dataset is curated and created by the Free Law Project, selecting the train split from the opinions-metadata dataset. It is used for fine-tuning encoder models for semantic search with a 512 context window. The data is split into train and dev sets, ensuring no overlap in opinion_id, cluster_id, docket_id, and docket_number. Each opinion is split into chunks of at most 480 tokens (tokenized using the bert-base-cased tokenizer) with a 2-sentence overlap. These chunks are provided to GPT-4o with a system prompt to generate both relevant and irrelevant queries.
提供机构:
freelawproject



