five

lemone-docs-embedded

收藏
魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/louisbrulenaudet/lemone-docs-embedded
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Description - **Repository:** https://huggingface.co/datasets/louisbrulenaudet/lemone-docs-embedded - **Point of Contact:** [Louis Brulé Naudet](mailto:louisbrulenaudet@icloud.com) <img src="assets/thumbnail.webp"> # Lemone-embedded, pre-built embeddings dataset for French taxation. <div class="not-prose bg-gradient-to-r from-gray-50-to-white text-gray-900 border" style="border-radius: 8px; padding: 0.5rem 1rem;"> <p>This database presents the embeddings generated by the Lemone-embed-pro model and aims at a large-scale distribution of the model even for the GPU-poor.</p> </div> This sentence transformers model, specifically designed for French taxation, has been fine-tuned on a dataset comprising 43 million tokens, integrating a blend of semi-synthetic and fully synthetic data generated by GPT-4 Turbo and Llama 3.1 70B, which have been further refined through evol-instruction tuning and manual curation. The model is tailored to meet the specific demands of information retrieval across large-scale tax-related corpora, supporting the implementation of production-ready Retrieval-Augmented Generation (RAG) applications. Its primary purpose is to enhance the efficiency and accuracy of legal processes in the taxation domain, with an emphasis on delivering consistent performance in real-world settings, while also contributing to advancements in legal natural language processing research. This is a sentence-transformers model finetuned from Alibaba-NLP/gte-multilingual-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Usage with ChromaDB We recommend integration via a vector-store to produce an optimal RAG pipeline. Here's a code extract for producing such a database with ChromaDB: ```python import chromadb import polars as pl from chromadb.config import Settings from chromadb.utils import embedding_functions from torch.cuda import is_available client = chromadb.PersistentClient( path="./chroma.db", settings=Settings(anonymized_telemetry=False) ) sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction( model_name="louisbrulenaudet/lemone-embed-pro", device="cuda" if is_available() else "cpu", trust_remote_code=True ) collection = client.get_or_create_collection( name="tax", embedding_function=sentence_transformer_ef ) dataframe = pl.scan_parquet('hf://datasets/louisbrulenaudet/lemone-docs-embedded/data/train-00000-of-00001.parquet').filter( pl.col( "text" ).is_not_null() ).collect() collection.add( embeddings=dataframe["lemone_pro_embeddings"].to_list(), documents=dataframe["text"].to_list(), metadatas=dataframe.drop( [ "lemone_pro_embeddings", "text" ] ).to_dicts(), ids=[ str(i) for i in range(0, dataframe.shape[0]) ] ) ``` Here is a code for reproduction of this dataset: ```python import hashlib from datetime import datetime from typing import ( IO, TYPE_CHECKING, Any, Dict, List, Type, Tuple, Union, Mapping, TypeVar, Callable, Optional, Sequence, ) import chromadb import polars as pl from chromadb.config import Settings from chromadb.utils import embedding_functions from torch.cuda import is_available client = chromadb.Client( settings=Settings(anonymized_telemetry=False) ) sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction( model_name="louisbrulenaudet/lemone-embed-pro", device="cuda" if is_available() else "cpu", trust_remote_code=True ) collection = client.get_or_create_collection( name="tax", embedding_function=sentence_transformer_ef ) bofip_dataframe = pl.scan_parquet( "hf://datasets/louisbrulenaudet/bofip/data/train-00000-of-00001.parquet" ).with_columns( [ ( pl.lit("Bulletin officiel des finances publiques - impôts").alias( "title_main" ) ), ( pl.col("debut_de_validite") .str.strptime(pl.Date, format="%Y-%m-%d") .dt.strftime("%Y-%m-%d 00:00:00") ).alias("date_publication"), ( pl.col("contenu") .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8) .alias("hash") ) ] ).rename( { "contenu": "text", "permalien": "url_sourcepage", "identifiant_juridique": "id_sub", } ).select( [ "text", "title_main", "id_sub", "url_sourcepage", "date_publication", "hash" ] ) books: List[str] = [ "hf://datasets/louisbrulenaudet/code-douanes/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/code-impots/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/code-impots-annexe-i/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/code-impots-annexe-ii/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/code-impots-annexe-iii/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/code-impots-annexe-iv/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/code-impositions-biens-services/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/livre-procedures-fiscales/data/train-00000-of-00001.parquet" ] legi_dataframe = pl.concat( [ pl.scan_parquet( book ) for book in books ] ).with_columns( [ ( pl.lit("https://www.legifrance.gouv.fr/codes/article_lc/") .add(pl.col("id")) .alias("url_sourcepage") ), ( pl.col("dateDebut") .cast(pl.Int64) .map_elements( lambda x: datetime.fromtimestamp(x / 1000).strftime("%Y-%m-%d %H:%M:%S"), return_dtype=pl.Utf8 ) .alias("date_publication") ), ( pl.col("texte") .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8) .alias("hash") ) ] ).rename( { "texte": "text", "num": "id_sub", } ).select( [ "text", "title_main", "id_sub", "url_sourcepage", "date_publication", "hash" ] ) print("Starting embeddings production...") dataframe = pl.concat( [ bofip_dataframe, legi_dataframe ] ).filter( pl.col( "text" ).is_not_null() ).with_columns( pl.col("text").map_elements( lambda x: sentence_transformer_ef( [x] )[0].tolist(), return_dtype=pl.List(pl.Float64) ).alias("lemone_pro_embeddings") ).collect() ``` ## Citation If you use this code in your research, please use the following BibTeX entry. ```BibTeX @misc{louisbrulenaudet2024, author = {Louis Brulé Naudet}, title = {Lemone-Embed: A Series of Fine-Tuned Embedding Models for French Taxation}, year = {2024} howpublished = {\url{https://huggingface.co/datasets/louisbrulenaudet/lemone-embed-pro}}, } ``` ## Feedback If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).

## 数据集说明 - **仓库地址**:https://huggingface.co/datasets/louisbrulenaudet/lemone-docs-embedded - **联系人**:[Louis Brulé Naudet](mailto:louisbrulenaudet@icloud.com) ![缩略图](assets/thumbnail.webp) # Lemone嵌入式数据集:面向法国税务的预构建嵌入向量数据集 <div class="not-prose bg-gradient-to-r from-gray-50-to-white text-gray-900 border" style="border-radius: 8px; padding: 0.5rem 1rem;"> <p>本数据库展示了由Lemone-embed-pro模型生成的嵌入向量,旨在让即便算力有限(GPU资源匮乏)的用户也能大规模部署该模型。</p> </div> 该句向量模型(sentence transformers)专为法国税务场景设计,在包含4300万Token的数据集上完成微调,该数据集融合了半合成与全合成数据,这些数据由GPT-4 Turbo与Llama 3.1 70B生成,并通过进化指令微调(evol-instruction tuning)与人工审核进一步优化。 本模型旨在满足大规模税务语料库的信息检索特定需求,可支撑可投入生产的检索增强生成(Retrieval-Augmented Generation, RAG)应用落地。其核心目标是提升税务领域法律流程的效率与准确性,强调在真实业务场景中保持稳定性能,同时助力法律自然语言处理(Natural Language Processing, NLP)研究的发展。 本模型基于Alibaba-NLP/gte-multilingual-base微调得到,属于句向量模型,可将句子与段落映射至768维稠密向量空间,可用于语义文本相似度计算、语义搜索、释义挖掘、文本分类、聚类等多种任务。 ## 结合ChromaDB使用 我们建议通过向量数据库集成以构建最优的RAG流水线。以下是使用ChromaDB构建此类数据库的代码示例: python import chromadb import polars as pl from chromadb.config import Settings from chromadb.utils import embedding_functions from torch.cuda import is_available client = chromadb.PersistentClient( path="./chroma.db", settings=Settings(anonymized_telemetry=False) ) sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction( model_name="louisbrulenaudet/lemone-embed-pro", device="cuda" if is_available() else "cpu", trust_remote_code=True ) collection = client.get_or_create_collection( name="tax", embedding_function=sentence_transformer_ef ) dataframe = pl.scan_parquet('hf://datasets/louisbrulenaudet/lemone-docs-embedded/data/train-00000-of-00001.parquet').filter( pl.col( "text" ).is_not_null() ).collect() collection.add( embeddings=dataframe["lemone_pro_embeddings"].to_list(), documents=dataframe["text"].to_list(), metadatas=dataframe.drop( [ "lemone_pro_embeddings", "text" ] ).to_dicts(), ids=[ str(i) for i in range(0, dataframe.shape[0]) ] ) 以下是复现该数据集的代码: python import hashlib from datetime import datetime from typing import ( IO, TYPE_CHECKING, Any, Dict, List, Type, Tuple, Union, Mapping, TypeVar, Callable, Optional, Sequence, ) import chromadb import polars as pl from chromadb.config import Settings from chromadb.utils import embedding_functions from torch.cuda import is_available client = chromadb.Client( settings=Settings(anonymized_telemetry=False) ) sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction( model_name="louisbrulenaudet/lemone-embed-pro", device="cuda" if is_available() else "cpu", trust_remote_code=True ) collection = client.get_or_create_collection( name="tax", embedding_function=sentence_transformer_ef ) bofip_dataframe = pl.scan_parquet( "hf://datasets/louisbrulenaudet/bofip/data/train-00000-of-00001.parquet" ).with_columns( [ ( pl.lit("Bulletin officiel des finances publiques - impôts").alias( "title_main" ) ), ( pl.col("debut_de_validite") .str.strptime(pl.Date, format="%Y-%m-%d") .dt.strftime("%Y-%m-%d 00:00:00") ).alias("date_publication"), ( pl.col("contenu") .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8) .alias("hash") ) ] ).rename( { "contenu": "text", "permalien": "url_sourcepage", "identifiant_juridique": "id_sub", } ).select( [ "text", "title_main", "id_sub", "url_sourcepage", "date_publication", "hash" ] ) books: List[str] = [ "hf://datasets/louisbrulenaudet/code-douanes/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/code-impots/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/code-impots-annexe-i/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/code-impots-annexe-ii/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/code-impots-annexe-iii/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/code-impots-annexe-iv/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/code-impositions-biens-services/data/train-00000-of-00001.parquet", "hf://datasets/louisbrulenaudet/livre-procedures-fiscales/data/train-00000-of-00001.parquet" ] legi_dataframe = pl.concat( [ pl.scan_parquet( book ) for book in books ] ).with_columns( [ ( pl.lit("https://www.legifrance.gouv.fr/codes/article_lc/") .add(pl.col("id")) .alias("url_sourcepage") ), ( pl.col("dateDebut") .cast(pl.Int64) .map_elements( lambda x: datetime.fromtimestamp(x / 1000).strftime("%Y-%m-%d %H:%M:%S"), return_dtype=pl.Utf8 ) .alias("date_publication") ), ( pl.col("texte") .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8) .alias("hash") ) ] ).rename( { "texte": "text", "num": "id_sub", } ).select( [ "text", "title_main", "id_sub", "url_sourcepage", "date_publication", "hash" ] ) print("Starting embeddings production...") dataframe = pl.concat( [ bofip_dataframe, legi_dataframe ] ).filter( pl.col( "text" ).is_not_null() ).with_columns( pl.col("text").map_elements( lambda x: sentence_transformer_ef( [x] )[0].tolist(), return_dtype=pl.List(pl.Float64) ).alias("lemone_pro_embeddings") ).collect() ## 引用 若您在研究中使用本代码,请采用以下BibTeX条目: BibTeX @misc{louisbrulenaudet2024, author = {Louis Brulé Naudet}, title = {Lemone-Embed: A Series of Fine-Tuned Embedding Models for French Taxation}, year = {2024} howpublished = {url{https://huggingface.co/datasets/louisbrulenaudet/lemone-embed-pro}}, } ## 反馈 若有任何反馈,请通过[louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com)联系我们。
提供机构:
maas
创建时间:
2025-10-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作