ArchitRastogi/italian-embedding-finetune-dataset

Name: ArchitRastogi/italian-embedding-finetune-dataset
Creator: ArchitRastogi
Published: 2024-12-02 14:47:31
License: 暂无描述

Hugging Face2024-12-02 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/ArchitRastogi/italian-embedding-finetune-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集旨在微调基于BERT的意大利语嵌入模型，以提高信息检索、语义搜索和嵌入生成等任务的性能。数据集采用C4数据集的意大利语子集，并通过滑动窗口分割和文档内采样等技术构建高质量的训练样本。数据集以.jsonl格式存储，包含查询、相关文本段和具有挑战性的非相关文本段。训练集包含113万行数据（约4.5 GB），测试集包含909万行数据（约0.5 GB）。

This is a dataset for fine-tuning BERT models with Italian embeddings, designed to enhance performance on tasks such as information retrieval, semantic search, and embedding generation. The dataset is based on the Italian subset of the C4 dataset and uses techniques like sliding window segmentation and in-document sampling to create high-quality, diverse samples. The dataset is stored in .jsonl format and includes fields such as query, relevant text segment, and challenging non-relevant text segment. The training set contains 1.13 million rows, and the test set contains 9.09 million rows. Both the dataset and the fine-tuned model are licensed under the Apache 2.0 License, and appropriate credit must be provided when used.

提供机构：

ArchitRastogi

5,000+

优质数据集

54 个

任务类型

进入经典数据集