laion/Caselaw_Access_Project_embeddings

Name: laion/Caselaw_Access_Project_embeddings
Creator: laion
Published: 2025-03-31 22:03:39
License: 暂无描述

Hugging Face2025-03-31 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/laion/Caselaw_Access_Project_embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个为Caselaw Access Project创建的嵌入数据集，由用户Endomorphosis生成。数据集中的每个案例法条目都通过IPFS/multiformats进行哈希处理，以便在IPFS/filecoin网络上检索文档。数据集使用了三个模型生成嵌入：thenlper/gte-small、Alibaba-NLP/gte-large-en-v1.5和Alibaba-NLP/gte-Qwen2-1.5B-instruct。这些模型的上下文长度分别为512、8192和32k tokens，维度分别为384、1024和1536。嵌入被分为4096个簇，每个簇的质心和内容ID都提供。在客户端搜索嵌入时，建议首先查询质心，然后检索最接近的gte-small簇，并在该簇内进行查询。

This is an embeddings dataset for the Caselaw Access Project, created by a user named Endomorphosis. Each caselaw entry is hashed with IPFS / multiformats, so retrieval of the document can be made over the IPFS / filecoin network. The dataset has been had embeddings generated with three models: thenlper/gte-small, Alibaba-NLP/gte-large-en-v1.5, and Alibaba-NLP/gte-Qwen2-1.5B-instruct. These models have a context length of 512, 8192, and 32k tokens respectively, with 384, 1024, and 1536 dimensions. These embeddings are put into 4096 clusters, the centroids for each cluster is provided, as well as the content ids for each cluster, for each model. To search the embeddings on the client side, it would be wise to first query against the centroids, and then retrieve the closest gte-small cluster, and then query against the cluster.

提供机构：

laion

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集