HFforLegal/embedding-models

Name: HFforLegal/embedding-models
Creator: HFforLegal
Published: 2024-07-22 10:56:24
License: 暂无描述

Hugging Face2024-07-22 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/HFforLegal/embedding-models

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 dataset_info: features: - name: model dtype: string - name: query_prefix dtype: string - name: passage_prefix dtype: string - name: embedding_size dtype: int64 - name: revision dtype: string - name: model_type dtype: string - name: torch_dtype dtype: string - name: max_length dtype: int64 splits: - name: train num_bytes: 475 num_examples: 5 download_size: 4533 dataset_size: 475 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - tabular-to-text - tabular-classification - sentence-similarity - question-answering language: - en tags: - legal - reference - automation - HFforLegal pretty_name: Reference models for integration into HF for Legal size_categories: - n<1K --- ## Dataset Description - **Repository:** https://huggingface.co/datasets/HFforLegal/embedding-models - **Leaderboard:** N/A - **Point of Contact:** [Louis Brulé Naudet](mailto:louisbrulenaudet@icloud.com) - # Reference models for integration into HF for Legal 🤗 This dataset comprises a collection of models aimed at streamlining and partially automating the embedding process. Each model entry within this dataset includes essential information such as model identifiers, embedding configurations, and specific parameters, ensuring that users can seamlessly integrate these models into their workflows with minimal setup and maximum efficiency. ## Dataset Structure | Field | Type | Description | |-----------------|--------|-----------------------------------------------------------------------------| | `model` | str | The identifier of the model, typically formatted as `organization/model-name`.| | `query_prefix` | str | A prefix string added to query inputs to delineate them. | | `passage_prefix`| str | A prefix string added to passage inputs to delineate them. | | `embedding_size`| int | The dimensional size of the embedding vectors produced by the model. | | `revision` | str | The specific revision identifier of the model to ensure consistency. | | `model_type` | str | The architectural type of the model, such as `xlm-roberta` or `qwen2`. | | `torch_dtype` | str | The data type utilized in PyTorch operations, such as `float32`. | | `max_length` | int | The maximum input length the model can process, specified in tokens. | ### Organization architecture In order to simplify the deployment of the organization's various tools, we propose a simple architecture in which datasets containing the various legal and contractual texts are doubled by datasets containing embeddings for different models, to enable simplified index creation for Spaces initialization and the provision of vector data for the GPU-poor. A simplified representation might look like this: <img src="https://huggingface.co/spaces/HFforLegal/README/resolve/main/assets/HF%20for%20Legal%20architecture%20for%20easy%20deployment.png"> ## Citing & Authors If you use this dataset in your research, please use the following BibTeX entry. ```BibTeX @misc{HFforLegal2024, author = {Louis Brulé Naudet}, title = {Reference models for integration into HF for Legal}, year = {2024} howpublished = {\url{https://huggingface.co/datasets/HFforLegal/embedding-models}}, } ``` ## Feedback If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).

license: Apache 2.0许可证 dataset_info: 特征： - 名称：model，数据类型：字符串 - 名称：query_prefix，数据类型：字符串 - 名称：passage_prefix，数据类型：字符串 - 名称：embedding_size，数据类型：int64（64位整数） - 名称：revision，数据类型：字符串 - 名称：model_type，数据类型：字符串 - 名称：torch_dtype，数据类型：字符串 - 名称：max_length，数据类型：int64（64位整数）数据划分： - 名称：train（训练集），字节数：475，样本数：5 下载大小：4533 数据集总大小：475 配置项： - 配置名称：default（默认配置），数据文件： - 划分：train，路径：data/train-* 任务类别： - 表格转文本 - 表格分类 - 句子相似度 - 问答语言： - en（英语）标签： - 法律 - 参考 - 自动化 - HFforLegal 展示名称：适配Hugging Face法律场景的参考模型样本规模类别：n<1K ## 数据集描述 - **仓库地址**：https://huggingface.co/datasets/HFforLegal/embedding-models - **排行榜**：无 - **联系人**：[Louis Brulé Naudet](mailto:louisbrulenaudet@icloud.com) # 适配Hugging Face法律场景的参考模型 🤗 本数据集收录了一系列旨在简化并部分自动化嵌入流程的模型。数据集中的每个模型条目均包含模型标识符、嵌入配置与特定参数等关键信息，可帮助用户以最低配置成本与最高效率将这些模型无缝集成至自身工作流中。 ## 数据集结构 | 字段名 | 类型 | 说明 | |-----------------|--------|----------------------------------------------------------------------| | `model` | 字符串 | 模型标识符，通常格式为`组织/模型名称`。 | | `query_prefix` | 字符串 | 添加至查询输入前用于界定查询内容的前缀字符串。 | | `passage_prefix`| 字符串 | 添加至段落输入前用于界定段落内容的前缀字符串。 | | `embedding_size`| 64位整数 | 模型生成的嵌入向量的维度大小。 | | `revision` | 字符串 | 用于确保一致性的模型特定修订标识符。 | | `model_type` | 字符串 | 模型的架构类型，例如`xlm-roberta`或`qwen2`。 | | `torch_dtype` | 字符串 | PyTorch运算中使用的数据类型，例如`float32`。 | | `max_length` | 64位整数 | 模型可处理的最大输入长度，单位为Token。 | ### 机构架构设计为简化本机构各类工具的部署流程，我们提出了一种简易架构：将存储各类法律与合同文本的数据集，与存储不同模型嵌入结果的数据集相配合，以实现Space初始化时简化索引创建，并为缺乏GPU资源的用户提供向量数据支持。简易架构示意图如下： <img src="https://huggingface.co/spaces/HFforLegal/README/resolve/main/assets/HF%20for%20Legal%20architecture%20for%20easy%20deployment.png"> ## 引用与作者若您在研究中使用本数据集，请使用以下BibTeX条目进行引用： BibTeX @misc{HFforLegal2024, author = {Louis Brulé Naudet}, title = {适配Hugging Face法律场景的参考模型}, year = {2024} howpublished = {url{https://huggingface.co/datasets/HFforLegal/embedding-models}}, } ## 反馈若您有任何反馈，请联系[louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com)。

提供机构：

HFforLegal

5,000+

优质数据集

54 个

任务类型

进入经典数据集