five

Caselaw_Access_Project_embeddings

收藏
魔搭社区2025-11-27 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/laion/Caselaw_Access_Project_embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
Original Repository: https://huggingface.co/datasets/justicedao/Caselaw_Access_Project_embeddings/ This is an embeddings dataset for the Caselaw Access Project, created by a user named Endomorphosis. Each caselaw entry is hashed with IPFS / multiformats, so retrieval of the document can be made over the IPFS / filecoin network The ipfs content id "cid" is the primary key that links the dataset to the embeddings, should you want to retrieve from the dataset instead. The dataset has been had embeddings generated with three models: thenlper/gte-small, Alibaba-NLP/gte-large-en-v1.5, and Alibaba-NLP/gte-Qwen2-1.5B-instruct Those models have a context length of 512, 8192, and 32k tokens respectively, with 384, 1024, and 1536 dimensions These embeddings are put into 4096 clusters, the centroids for each cluster is provided, as well as the content ids for each cluster, for each model. To search the embeddings on the client side, it would be wise to first query against the centroids, and then retrieve the closest gte-small cluster, and then query against the cluster.

原始仓库:https://huggingface.co/datasets/justicedao/Caselaw_Access_Project_embeddings/ 本数据集为判例访问项目(Caselaw Access Project)的嵌入向量数据集,由用户Endomorphosis创建。 每条判例条目均通过IPFS(InterPlanetary File System)/多格式(multiformats)进行哈希处理,因此可通过IPFS/Filecoin网络获取原始文档。 IPFS内容标识符(CID)是关联本数据集与嵌入向量的主键,也可直接通过该标识符从数据集中检索目标内容。 本数据集使用三款模型生成嵌入向量:thenlper/gte-small、Alibaba-NLP/gte-large-en-v1.5以及Alibaba-NLP/gte-Qwen2-1.5B-instruct。 这三款模型的上下文长度分别为512、8192和32k Token,对应的向量维度分别为384、1024和1536。 上述嵌入向量被划分为4096个聚类,已提供每个聚类的质心,以及各模型对应聚类的内容标识符。 若需在客户端侧检索嵌入向量,建议先对聚类质心进行查询,匹配到与查询最相近的gte-small聚类后,再针对该聚类开展后续检索。
提供机构:
maas
创建时间:
2025-10-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作