nickprock/it-wiki-retrieval-synthetic-hn

Name: nickprock/it-wiki-retrieval-synthetic-hn
Creator: nickprock
Published: 2026-04-02 14:48:19
License: 暂无描述

Hugging Face2026-04-02 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/nickprock/it-wiki-retrieval-synthetic-hn

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - it task_categories: - sentence-similarity - feature-extraction - text-retrieval pretty_name: Italian Synthetic Retrieval Dataset with Hard Negatives size_categories: - 10K<n<100K license: apache-2.0 dataset_info: features: - name: query dtype: string - name: positive dtype: string - name: hard_negatives sequence: string splits: - name: train num_bytes: 75912390 num_examples: 50001 download_size: 36239382 dataset_size: 75912390 configs: - config_name: default data_files: - split: train path: data/train-* --- # Italian Synthetic Retrieval Dataset with Dense Hard Negatives ## Dataset Summary This dataset is a high-quality, synthetic Information Retrieval (IR) dataset for the Italian language. It is designed to train and fine-tune state-of-the-art embedding models and bi-encoders (e.g., using `MultipleNegativesRankingLoss`). The dataset consists of exactly **50,000 synthetic search queries** generated from approximately **25,000 unique passages** extracted from the Italian Wikipedia. Furthermore, it is augmented with **Dense Hard Negatives** mined iteratively using a strong cross-encoder/bi-encoder baseline to provide a highly challenging contrastive learning environment. ## Dataset Structure The dataset contains 50,000 rows. Each row represents a training instance with the following fields: - `query` (or `anchor`): A synthetically generated search query in Italian (minimum 5 words). It alternates between a specific question and a broad semantic search intent. - `positive`: The ground-truth paragraph from Italian Wikipedia containing the exact context for the query. - `hard_negatives`: A list of text passages that are semantically similar to the query but *do not* answer it (mined via Dense Retrieval). ## Dataset Creation Pipeline The creation of this dataset follows a modern, highly optimized pipeline to prevent common contrastive learning issues (like vector space anisotropy). 1. **Source Data:** ~25,000 paragraphs were sampled from the Italian Wikipedia dump (`wikimedia/wikipedia`, `20231101.it`). 2. **Synthetic Query Generation:** We used the **Qwen-2.5-7B** LLM locally to act as a data engineer. For each paragraph, the LLM was prompted strictly via structured JSON schema (Pydantic) to generate two types of queries: - One specific question (`domanda_specifica`). - One broad semantic search intent (`ricerca_semantica`). - *Constraint:* Single-word keyword queries were explicitly forbidden to force the generation of long-tail semantic queries. 3. **Dense Hard Negative Mining:** Instead of relying on traditional lexical search (BM25), which often produces weak negatives, we employed a **Dense Self-Mining** approach. We used `nickprock/multi-sentence-BERTino` (V1) to encode the entire corpus and perform a semantic search for each query. The top highly-ranked documents that were *not* the actual positive passage were selected as Hard Negatives. ## Intended Use This dataset is plug-and-play for training models using the `sentence-transformers` library. **Training Recommendation:** When training with `CachedMultipleNegativesRankingLoss` or `MultipleNegativesRankingLoss`, we highly recommend unpacking the `hard_negatives` list into explicit columns (e.g., `negative_1`, `negative_2`) and using `BatchSamplers.NO_DUPLICATES` to avoid in-batch collisions and false negatives. ## Author Created by [Nicola Procopio](https://huggingface.co/nickprock).

提供机构：

nickprock

5,000+

优质数据集

54 个

任务类型

进入经典数据集