HFforLegal/embedding-models
收藏Hugging Face2024-07-22 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/HFforLegal/embedding-models
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
dataset_info:
features:
- name: model
dtype: string
- name: query_prefix
dtype: string
- name: passage_prefix
dtype: string
- name: embedding_size
dtype: int64
- name: revision
dtype: string
- name: model_type
dtype: string
- name: torch_dtype
dtype: string
- name: max_length
dtype: int64
splits:
- name: train
num_bytes: 475
num_examples: 5
download_size: 4533
dataset_size: 475
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
task_categories:
- tabular-to-text
- tabular-classification
- sentence-similarity
- question-answering
language:
- en
tags:
- legal
- reference
- automation
- HFforLegal
pretty_name: Reference models for integration into HF for Legal
size_categories:
- n<1K
---
## Dataset Description
- **Repository:** https://huggingface.co/datasets/HFforLegal/embedding-models
- **Leaderboard:** N/A
- **Point of Contact:** [Louis Brulé Naudet](mailto:louisbrulenaudet@icloud.com)
-
# Reference models for integration into HF for Legal 🤗
This dataset comprises a collection of models aimed at streamlining and partially automating the embedding process. Each model entry within this dataset includes essential information such as model identifiers, embedding configurations, and specific parameters, ensuring that users can seamlessly integrate these models into their workflows with minimal setup and maximum efficiency.
## Dataset Structure
| Field | Type | Description |
|-----------------|--------|-----------------------------------------------------------------------------|
| `model` | str | The identifier of the model, typically formatted as `organization/model-name`.|
| `query_prefix` | str | A prefix string added to query inputs to delineate them. |
| `passage_prefix`| str | A prefix string added to passage inputs to delineate them. |
| `embedding_size`| int | The dimensional size of the embedding vectors produced by the model. |
| `revision` | str | The specific revision identifier of the model to ensure consistency. |
| `model_type` | str | The architectural type of the model, such as `xlm-roberta` or `qwen2`. |
| `torch_dtype` | str | The data type utilized in PyTorch operations, such as `float32`. |
| `max_length` | int | The maximum input length the model can process, specified in tokens. |
### Organization architecture
In order to simplify the deployment of the organization's various tools, we propose a simple architecture in which datasets containing the various legal and contractual texts are doubled by datasets containing embeddings for different models, to enable simplified index creation for Spaces initialization and the provision of vector data for the GPU-poor. A simplified representation might look like this:
<img src="https://huggingface.co/spaces/HFforLegal/README/resolve/main/assets/HF%20for%20Legal%20architecture%20for%20easy%20deployment.png">
## Citing & Authors
If you use this dataset in your research, please use the following BibTeX entry.
```BibTeX
@misc{HFforLegal2024,
author = {Louis Brulé Naudet},
title = {Reference models for integration into HF for Legal},
year = {2024}
howpublished = {\url{https://huggingface.co/datasets/HFforLegal/embedding-models}},
}
```
## Feedback
If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).
license: Apache 2.0许可证
dataset_info:
特征:
- 名称:model,数据类型:字符串
- 名称:query_prefix,数据类型:字符串
- 名称:passage_prefix,数据类型:字符串
- 名称:embedding_size,数据类型:int64(64位整数)
- 名称:revision,数据类型:字符串
- 名称:model_type,数据类型:字符串
- 名称:torch_dtype,数据类型:字符串
- 名称:max_length,数据类型:int64(64位整数)
数据划分:
- 名称:train(训练集),字节数:475,样本数:5
下载大小:4533
数据集总大小:475
配置项:
- 配置名称:default(默认配置),数据文件:
- 划分:train,路径:data/train-*
任务类别:
- 表格转文本
- 表格分类
- 句子相似度
- 问答
语言:
- en(英语)
标签:
- 法律
- 参考
- 自动化
- HFforLegal
展示名称:适配Hugging Face法律场景的参考模型
样本规模类别:n<1K
## 数据集描述
- **仓库地址**:https://huggingface.co/datasets/HFforLegal/embedding-models
- **排行榜**:无
- **联系人**:[Louis Brulé Naudet](mailto:louisbrulenaudet@icloud.com)
# 适配Hugging Face法律场景的参考模型 🤗
本数据集收录了一系列旨在简化并部分自动化嵌入流程的模型。数据集中的每个模型条目均包含模型标识符、嵌入配置与特定参数等关键信息,可帮助用户以最低配置成本与最高效率将这些模型无缝集成至自身工作流中。
## 数据集结构
| 字段名 | 类型 | 说明 |
|-----------------|--------|----------------------------------------------------------------------|
| `model` | 字符串 | 模型标识符,通常格式为`组织/模型名称`。 |
| `query_prefix` | 字符串 | 添加至查询输入前用于界定查询内容的前缀字符串。 |
| `passage_prefix`| 字符串 | 添加至段落输入前用于界定段落内容的前缀字符串。 |
| `embedding_size`| 64位整数 | 模型生成的嵌入向量的维度大小。 |
| `revision` | 字符串 | 用于确保一致性的模型特定修订标识符。 |
| `model_type` | 字符串 | 模型的架构类型,例如`xlm-roberta`或`qwen2`。 |
| `torch_dtype` | 字符串 | PyTorch运算中使用的数据类型,例如`float32`。 |
| `max_length` | 64位整数 | 模型可处理的最大输入长度,单位为Token。 |
### 机构架构设计
为简化本机构各类工具的部署流程,我们提出了一种简易架构:将存储各类法律与合同文本的数据集,与存储不同模型嵌入结果的数据集相配合,以实现Space初始化时简化索引创建,并为缺乏GPU资源的用户提供向量数据支持。简易架构示意图如下:
<img src="https://huggingface.co/spaces/HFforLegal/README/resolve/main/assets/HF%20for%20Legal%20architecture%20for%20easy%20deployment.png">
## 引用与作者
若您在研究中使用本数据集,请使用以下BibTeX条目进行引用:
BibTeX
@misc{HFforLegal2024,
author = {Louis Brulé Naudet},
title = {适配Hugging Face法律场景的参考模型},
year = {2024}
howpublished = {url{https://huggingface.co/datasets/HFforLegal/embedding-models}},
}
## 反馈
若您有任何反馈,请联系[louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com)。
提供机构:
HFforLegal



