five

laion/Caselaw_Access_Project_embeddings

收藏
Hugging Face2025-03-31 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/laion/Caselaw_Access_Project_embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
这是一个为Caselaw Access Project创建的嵌入数据集,由用户Endomorphosis生成。数据集中的每个案例法条目都通过IPFS/multiformats进行哈希处理,以便在IPFS/filecoin网络上检索文档。数据集使用了三个模型生成嵌入:thenlper/gte-small、Alibaba-NLP/gte-large-en-v1.5和Alibaba-NLP/gte-Qwen2-1.5B-instruct。这些模型的上下文长度分别为512、8192和32k tokens,维度分别为384、1024和1536。嵌入被分为4096个簇,每个簇的质心和内容ID都提供。在客户端搜索嵌入时,建议首先查询质心,然后检索最接近的gte-small簇,并在该簇内进行查询。

This is an embeddings dataset for the Caselaw Access Project, created by a user named Endomorphosis. Each caselaw entry is hashed with IPFS / multiformats, so retrieval of the document can be made over the IPFS / filecoin network. The dataset has been had embeddings generated with three models: thenlper/gte-small, Alibaba-NLP/gte-large-en-v1.5, and Alibaba-NLP/gte-Qwen2-1.5B-instruct. These models have a context length of 512, 8192, and 32k tokens respectively, with 384, 1024, and 1536 dimensions. These embeddings are put into 4096 clusters, the centroids for each cluster is provided, as well as the content ids for each cluster, for each model. To search the embeddings on the client side, it would be wise to first query against the centroids, and then retrieve the closest gte-small cluster, and then query against the cluster.
提供机构:
laion
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作