nickprock/it-wiki-retrieval-synthetic-hn
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nickprock/it-wiki-retrieval-synthetic-hn
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- it
task_categories:
- sentence-similarity
- feature-extraction
- text-retrieval
pretty_name: Italian Synthetic Retrieval Dataset with Hard Negatives
size_categories:
- 10K<n<100K
license: apache-2.0
dataset_info:
features:
- name: query
dtype: string
- name: positive
dtype: string
- name: hard_negatives
sequence: string
splits:
- name: train
num_bytes: 75912390
num_examples: 50001
download_size: 36239382
dataset_size: 75912390
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Italian Synthetic Retrieval Dataset with Dense Hard Negatives
## Dataset Summary
This dataset is a high-quality, synthetic Information Retrieval (IR) dataset for the Italian language. It is designed to train and fine-tune state-of-the-art embedding models and bi-encoders (e.g., using `MultipleNegativesRankingLoss`).
The dataset consists of exactly **50,000 synthetic search queries** generated from approximately **25,000 unique passages** extracted from the Italian Wikipedia. Furthermore, it is augmented with **Dense Hard Negatives** mined iteratively using a strong cross-encoder/bi-encoder baseline to provide a highly challenging contrastive learning environment.
## Dataset Structure
The dataset contains 50,000 rows. Each row represents a training instance with the following fields:
- `query` (or `anchor`): A synthetically generated search query in Italian (minimum 5 words). It alternates between a specific question and a broad semantic search intent.
- `positive`: The ground-truth paragraph from Italian Wikipedia containing the exact context for the query.
- `hard_negatives`: A list of text passages that are semantically similar to the query but *do not* answer it (mined via Dense Retrieval).
## Dataset Creation Pipeline
The creation of this dataset follows a modern, highly optimized pipeline to prevent common contrastive learning issues (like vector space anisotropy).
1. **Source Data:** ~25,000 paragraphs were sampled from the Italian Wikipedia dump (`wikimedia/wikipedia`, `20231101.it`).
2. **Synthetic Query Generation:** We used the **Qwen-2.5-7B** LLM locally to act as a data engineer. For each paragraph, the LLM was prompted strictly via structured JSON schema (Pydantic) to generate two types of queries:
- One specific question (`domanda_specifica`).
- One broad semantic search intent (`ricerca_semantica`).
- *Constraint:* Single-word keyword queries were explicitly forbidden to force the generation of long-tail semantic queries.
3. **Dense Hard Negative Mining:** Instead of relying on traditional lexical search (BM25), which often produces weak negatives, we employed a **Dense Self-Mining** approach. We used `nickprock/multi-sentence-BERTino` (V1) to encode the entire corpus and perform a semantic search for each query. The top highly-ranked documents that were *not* the actual positive passage were selected as Hard Negatives.
## Intended Use
This dataset is plug-and-play for training models using the `sentence-transformers` library.
**Training Recommendation:**
When training with `CachedMultipleNegativesRankingLoss` or `MultipleNegativesRankingLoss`, we highly recommend unpacking the `hard_negatives` list into explicit columns (e.g., `negative_1`, `negative_2`) and using `BatchSamplers.NO_DUPLICATES` to avoid in-batch collisions and false negatives.
## Author
Created by [Nicola Procopio](https://huggingface.co/nickprock).
提供机构:
nickprock



