LocalDoc/azerbaijani_retriever_corpus-reranked
收藏Hugging Face2026-03-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LocalDoc/azerbaijani_retriever_corpus-reranked
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- az
license: cc-by-4.0
tags:
- retrieval
- reranking
- azerbaijani
- legislation
pretty_name: Azerbaijan Legislation Retrieval Corpus (Reranked)
dataset_info:
- config_name: corpus
features:
- name: chunk_id
dtype: string
- name: passage
dtype: string
splits:
- name: train
num_bytes: 67138014
num_examples: 65188
download_size: 37981328
dataset_size: 67138014
- config_name: hard_negatives
features:
- name: query_id
dtype: string
- name: chunk_id
dtype: string
- name: pos_score
dtype: float64
- name: neg_1_id
dtype: string
- name: neg_1_score
dtype: float64
- name: neg_2_id
dtype: string
- name: neg_2_score
dtype: float64
- name: neg_3_id
dtype: string
- name: neg_3_score
dtype: float64
- name: neg_4_id
dtype: string
- name: neg_4_score
dtype: float64
- name: neg_5_id
dtype: string
- name: neg_5_score
dtype: float64
- name: neg_6_id
dtype: string
- name: neg_6_score
dtype: float64
- name: neg_7_id
dtype: string
- name: neg_7_score
dtype: float64
- name: neg_8_id
dtype: string
- name: neg_8_score
dtype: float64
- name: neg_9_id
dtype: string
- name: neg_9_score
dtype: float64
- name: neg_10_id
dtype: string
- name: neg_10_score
dtype: float64
splits:
- name: train
num_bytes: 63959900
num_examples: 188941
download_size: 34604048
dataset_size: 63959900
- config_name: queries
features:
- name: query_id
dtype: string
- name: chunk_id
dtype: string
- name: query
dtype: string
splits:
- name: train
num_bytes: 20180731
num_examples: 188941
download_size: 9262163
dataset_size: 20180731
task_categories:
- sentence-similarity
size_categories:
- 10K<n<100K
configs:
- config_name: corpus
data_files:
- split: train
path: corpus/train-*
- config_name: hard_negatives
data_files:
- split: train
path: hard_negatives/train-*
- config_name: queries
data_files:
- split: train
path: queries/train-*
---
# Azerbaijan Legislation Retrieval Corpus — Reranked
Reranked version of [LocalDoc/azerbaijani_retriever_corpus](https://huggingface.co/datasets/LocalDoc/azerbaijani_retriever_corpus).
Hard negatives were re-scored with **BAAI/bge-reranker-v2-m3** cross-encoder. False negatives (score > 95% of positive score) were filtered out. Remaining negatives are sorted by score descending (hardest first).
## Configs
| Config | Rows | Description |
|---|---|---|
| `corpus` | 65,188 | Passage chunks: `chunk_id`, `passage` |
| `queries` | 188,941 | Queries: `query_id`, `chunk_id`, `query` |
| `hard_negatives` | 188,941 | Reranked negatives: `query_id`, `chunk_id`, `pos_score`, `neg_{1..10}_id`, `neg_{1..10}_score` |
`query_id` links `queries` and `hard_negatives`. `chunk_id` links to `corpus` (positive passage and negative IDs).
## Usage
```python
from datasets import load_dataset
corpus = load_dataset("LocalDoc/azerbaijani_retriever_corpus-reranked", "corpus")["train"]
queries = load_dataset("LocalDoc/azerbaijani_retriever_corpus-reranked", "queries")["train"]
hard_negs = load_dataset("LocalDoc/azerbaijani_retriever_corpus-reranked", "hard_negatives")["train"]
# Positive passage for a query
q = queries[0]
chunk2passage = {r["chunk_id"]: r["passage"] for r in corpus}
print(q["query"])
print(chunk2passage[q["chunk_id"]])
# Hard negatives
hn = hard_negs[0]
for k in range(1, 4):
nid = hn[f"neg_{k}_id"]
print(f"neg_{k} (score={hn[f'neg_{k}_score']:.4f}): {chunk2passage[nid][:100]}")
```
## Reranking details
- **Model**: `BAAI/bge-reranker-v2-m3`
- **Source negatives**: 100 per query (BM25 mined from original dataset)
- **False negative filter**: negatives with score > 95% of positive score removed
- **Output**: top 10 hardest negatives per query, sorted by descending score
提供机构:
LocalDoc



