LocalDoc/azerbaijani_books_retriever_corpus-reranked
收藏Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LocalDoc/azerbaijani_books_retriever_corpus-reranked
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: corpus
features:
- name: passage_id
dtype: large_string
- name: content
dtype: large_string
splits:
- name: train
num_bytes: 1214277260
num_examples: 570573
download_size: 646910354
dataset_size: 1214277260
- config_name: hard_negatives
features:
- name: query_id
dtype: string
- name: passage_id
dtype: string
- name: pos_score
dtype: float64
- name: neg_1_id
dtype: string
- name: neg_1_score
dtype: float64
- name: neg_2_id
dtype: string
- name: neg_2_score
dtype: float64
- name: neg_3_id
dtype: string
- name: neg_3_score
dtype: float64
- name: neg_4_id
dtype: string
- name: neg_4_score
dtype: float64
- name: neg_5_id
dtype: string
- name: neg_5_score
dtype: float64
- name: neg_6_id
dtype: string
- name: neg_6_score
dtype: float64
- name: neg_7_id
dtype: string
- name: neg_7_score
dtype: float64
- name: neg_8_id
dtype: string
- name: neg_8_score
dtype: float64
- name: neg_9_id
dtype: string
- name: neg_9_score
dtype: float64
- name: neg_10_id
dtype: string
- name: neg_10_score
dtype: float64
splits:
- name: train
num_bytes: 512278813
num_examples: 1616877
download_size: 288801845
dataset_size: 512278813
- config_name: queries
features:
- name: query_id
dtype: string
- name: passage_id
dtype: string
- name: query
dtype: string
- name: query_type
dtype: string
splits:
- name: train
num_bytes: 190703388
num_examples: 1616877
download_size: 67508957
dataset_size: 190703388
configs:
- config_name: corpus
data_files:
- split: train
path: corpus/train-*
- config_name: hard_negatives
data_files:
- split: train
path: hard_negatives/train-*
- config_name: queries
data_files:
- split: train
path: queries/train-*
license: cc-by-4.0
task_categories:
- sentence-similarity
language:
- az
tags:
- retrieval
- books
- azerbaijani
pretty_name: Azerbaijani Books Retrieval Dataset (Reranked)
size_categories:
- 1M<n<10M
---
# Azerbaijani Books Retrieval Dataset (Reranked)
A large-scale retrieval dataset built from [LocalDoc/books_dataset](https://huggingface.co/datasets/LocalDoc/books_dataset) — a collection of 2,804 Azerbaijani-language books with 7.8M sentences spanning politics, history, literature, science, and more. Designed for training and evaluating information retrieval, semantic search, and RAG pipelines in Azerbaijani.
## Dataset Configs
The dataset consists of three configs that can be joined via `passage_id` and `query_id`:
### `corpus`
The passage collection — one row per unique content passage.
| Column | Description |
|---|---|
| `passage_id` | Unique identifier of the passage (SHA-256 prefix) |
| `content` | The text passage (up to ~2000 characters) |
### `queries`
Three queries per passage (question, statement, keyword), each as a separate row.
| Column | Description |
|---|---|
| `query_id` | Unique query identifier (row index) |
| `passage_id` | Links to the relevant passage in `corpus` |
| `query` | The query text in Azerbaijani |
| `query_type` | One of: `question`, `statement`, `keyword` |
### `hard_negatives`
BM25-mined hard negatives scored by a cross-encoder reranker ([BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)). Each row contains up to 10 hard negative passage IDs with their reranker scores.
| Column | Description |
|---|---|
| `query_id` | Links to the query in `queries` |
| `passage_id` | Positive passage ID (links to `corpus`) |
| `pos_score` | Reranker score of the positive passage |
| `neg_{k}_id` | passage_id of the k-th hard negative |
| `neg_{k}_score` | Reranker score of the k-th hard negative |
## Source Dataset
Based on [LocalDoc/books_dataset](https://huggingface.co/datasets/LocalDoc/books_dataset) which contains 7.8M sentences from 2,804 Azerbaijani-language books. The original dataset provides sentence-level text with metadata (title, author, year, publisher, category). Sentences were reassembled by book ID and chunked into passages of up to ~2000 characters.
## Query Generation
For each passage chunk, three types of search queries were generated using an LLM:
- **question** — a natural question in Azerbaijani (e.g., "Hansı ölkələr enerji sahəsində əməkdaşlıq edir?")
- **statement** — a declarative statement describing the passage topic (e.g., "Azərbaycan-Türkiyə enerji əməkdaşlığı")
- **keyword** — a short keyword-style search query, 2–5 words (e.g., "enerji əməkdaşlıq TANAP qaz")
## Hard Negative Mining Pipeline
1. Book sentences were reassembled by ID and chunked into passages (~2000 chars max)
2. Unique passages were deduplicated by content
3. For each query, top-100 candidates were retrieved using BM25
4. The positive passage was excluded from candidates
5. Each candidate was scored with a cross-encoder reranker (BAAI/bge-reranker-v2-m3)
6. Candidates with scores above 95% of the positive score were filtered out as likely false negatives
7. Top-10 remaining negatives were kept, sorted by score (hardest first)
## Statistics
| Config | Rows |
|---|---|
| `corpus` | 570,573 passages |
| `queries` | 1,616,877 queries |
| `hard_negatives` | 1,616,877 rows × 10 negatives |
## Example
```python
from datasets import load_dataset
corpus = load_dataset("LocalDoc/azerbaijani_books_retriever_corpus-reranked", "corpus")["train"]
queries = load_dataset("LocalDoc/azerbaijani_books_retriever_corpus-reranked", "queries")["train"]
hard_negs = load_dataset("LocalDoc/azerbaijani_books_retriever_corpus-reranked", "hard_negatives")["train"]
# Build lookups
passage_lookup = {row["passage_id"]: row for row in corpus}
neg_lookup = {row["query_id"]: row for row in hard_negs}
# Pick a query
q = queries[0]
print(f"Query [{q['query_type']}]: {q['query']}")
# Positive passage
pos = passage_lookup[q["passage_id"]]
print(f"Positive: {pos['content'][:200]}...")
# Hard negatives
hn = neg_lookup[q["query_id"]]
print(f"Positive score: {hn['pos_score']:.4f}")
for k in range(1, 4):
nid = hn[f"neg_{k}_id"]
nscore = hn[f"neg_{k}_score"]
if nid:
neg = passage_lookup[nid]
print(f"Neg-{k} [score={nscore:.4f}]: {neg['content'][:200]}...")
```
### Example Output
```
Query [question]: Türkiyə-Yunanıstan qaz kəməri və Şahdəniz-2 layihəsi Avropanın enerji təhlükəsizliyinə necə təsir göstərir?
✅ Positive [score=5.3242]:
Hazırda Yunanıstan Avropa İttifaqının üzvü olaenerji təhlükəsizliyinin təmin edilməsində
önəmli rol oyna- raq, enerji sahəsində Azərbaycanla birbaşa əməkdaşlıq edir və yacaqdır.
onun az miqdarda olsa da ixrac qazını alır...
❌ Neg-1 [score=4.4609]:
Bu iki tarixi layihə bizi bir-birimizə çox sıx şəkildə bağlamaqdadır. Bu mənada bir
məqamı xüsusi vurğulamaq istəyirəm: Biz Xəzər hövzəsi və Orta Asiya təbii qazının
ölkəmizin ərazisindən alternativ marşrutlarla Avropaya nəqlini nəzərdə tutan...
❌ Neg-2 [score=4.2500]:
«Şahdəniz» yatağında qaz ehtiyatları 1 trilyon kubmetrdən çoxdur. Ümumiyyətlə,
Azərbaycanın digər yataqları ilə birlikdə qaz ehtiyatları 2.6 trilyon kubmetr təşkil edir.
İkinci layihə Azərbaycanı Gürcüstanla birləşdirən...
❌ Neg-3 [score=4.2109]:
TANAP Azərbaycan xalqının böyük lideri, mənim dostum İlham Əliyevin rəhbərliyi ilə
Azərbaycanın enerji təhlükəsizliyi, türk xalqının böyük lideri, Ukraynanın dostu və
mənim dostum Türkiyə Prezidenti Rəcəb Tayyib Ərdoğanın rəhbərliyi ilə...
```
## Contact
For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].
提供机构:
LocalDoc



