justicedao/Caselaw_Access_Project_embeddings
收藏Hugging Face2025-03-31 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/justicedao/Caselaw_Access_Project_embeddings
下载链接
链接失效反馈官方服务:
资源简介:
这是一个为Caselaw Access Project创建的嵌入数据集,由用户Endomorphosis创建。每个案例法条目都通过IPFS/multiformats进行哈希处理,以便通过IPFS/filecoin网络检索文档。数据集使用三个模型生成嵌入:thenlper/gte-small、Alibaba-NLP/gte-large-en-v1.5和Alibaba-NLP/gte-Qwen2-1.5B-instruct,这些模型的上下文长度和维度各不相同。嵌入被分为4096个簇,提供了每个簇的质心和内容ID。建议在客户端搜索嵌入时,先查询质心,然后检索最接近的gte-small簇,再对簇进行查询。
This is an embeddings dataset for the Caselaw Access Project, created by a user named Endomorphosis. Each caselaw entry is hashed with IPFS / multiformats, so retrieval of the document can be made over the IPFS / filecoin network. The ipfs content id cid is the primary key that links the dataset to the embeddings, should you want to retrieve from the dataset instead. The dataset has been had embeddings generated with three models: thenlper/gte-small, Alibaba-NLP/gte-large-en-v1.5, and Alibaba-NLP/gte-Qwen2-1.5B-instruct. Those models have a context length of 512, 8192, and 32k tokens respectively, with 384, 1024, and 1536 dimensions. These embeddings are put into 4096 clusters, the centroids for each cluster is provided, as well as the content ids for each cluster, for each model. To search the embeddings on the client side, it would be wise to first query against the centroids, and then retrieve the closest gte-small cluster, and then query against the cluster.
提供机构:
justicedao



