five

LocalDoc/ldquad_v2_retrieval-reranked

收藏
Hugging Face2026-03-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LocalDoc/ldquad_v2_retrieval-reranked
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: corpus features: - name: passage_id dtype: string - name: title dtype: string - name: content dtype: string splits: - name: train num_bytes: 36755279 num_examples: 38741 download_size: 20357329 dataset_size: 36755279 - config_name: hard_negatives features: - name: passage_id dtype: string - name: question dtype: string - name: pos_score dtype: float64 - name: neg_1_id dtype: string - name: neg_1_score dtype: float64 - name: neg_2_id dtype: string - name: neg_2_score dtype: float64 - name: neg_3_id dtype: string - name: neg_3_score dtype: float64 - name: neg_4_id dtype: string - name: neg_4_score dtype: float64 - name: neg_5_id dtype: string - name: neg_5_score dtype: float64 - name: neg_6_id dtype: string - name: neg_6_score dtype: float64 - name: neg_7_id dtype: string - name: neg_7_score dtype: float64 - name: neg_8_id dtype: string - name: neg_8_score dtype: float64 - name: neg_9_id dtype: string - name: neg_9_score dtype: float64 - name: neg_10_id dtype: string - name: neg_10_score dtype: float64 splits: - name: train num_bytes: 123482169 num_examples: 329990 download_size: 77478214 dataset_size: 123482169 - config_name: queries features: - name: passage_id dtype: string - name: question dtype: string - name: title dtype: string splits: - name: train num_bytes: 36635279 num_examples: 329990 download_size: 12596688 dataset_size: 36635279 configs: - config_name: corpus data_files: - split: train path: corpus/train-* - config_name: hard_negatives data_files: - split: train path: hard_negatives/train-* - config_name: queries data_files: - split: train path: queries/train-* license: cc-by-4.0 task_categories: - sentence-similarity language: - az tags: - retrieval - lquad - azerbaijani pretty_name: LDQuAd v2 Retrieval Dataset size_categories: - 100K<n<1M --- # LDQuAd v2 Retrieval Dataset A retrieval dataset built from [LocalDoc/LDQuAd_v2](https://huggingface.co/datasets/LocalDoc/LDQuAd_v2) — a question-answer dataset over Azerbaijani-language Wikipedia content. Designed for training and evaluating information retrieval, semantic search, and RAG pipelines in Azerbaijani. ## Dataset Configs The dataset consists of three configs that can be joined via `passage_id`: ### `corpus` The passage collection — one row per unique content passage. | Column | Description | |---|---| | `passage_id` | Unique identifier of the passage (SHA-256 prefix) | | `title` | Wikipedia article title | | `content` | The text passage | ### `queries` One question per passage, each as a separate row. | Column | Description | |---|---| | `passage_id` | Links to the relevant passage in `corpus` | | `title` | Wikipedia article title | | `question` | The question in Azerbaijani | ### `hard_negatives` BM25-mined hard negatives scored by a cross-encoder reranker ([BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)). Each row contains up to 10 hard negative passage IDs with their reranker scores. | Column | Description | |---|---| | `passage_id` | Positive passage ID (links to `corpus`) | | `question` | The question text in Azerbaijani | | `pos_score` | Reranker score of the positive passage | | `neg_{k}_id` | passage_id of the k-th hard negative | | `neg_{k}_score` | Reranker score of the k-th hard negative | ## Source Dataset Based on [LocalDoc/LDQuAd_v2](https://huggingface.co/datasets/LocalDoc/LDQuAd_v2) which contains 351,000 question-answer pairs derived from Azerbaijani-language content. Passages were filtered by content length (200–10,000 characters) and deduplicated before building the retrieval corpus. ## Hard Negative Mining Pipeline 1. Unique passages were extracted and deduplicated by content 2. For each question, top-100 candidates were retrieved using BM25 3. The positive passage was excluded from candidates 4. Each candidate was scored with a cross-encoder reranker (BAAI/bge-reranker-v2-m3) 5. Candidates with scores above 95% of the positive score were filtered out as likely false negatives 6. Top-10 remaining negatives were kept, sorted by score (hardest first) ## Example ```python from datasets import load_dataset corpus = load_dataset("LocalDoc/ldquad_v2_retrieval", "corpus")["train"] queries = load_dataset("LocalDoc/ldquad_v2_retrieval", "queries")["train"] hard_negs = load_dataset("LocalDoc/ldquad_v2_retrieval", "hard_negatives")["train"] # Build lookups passage_lookup = {row["passage_id"]: row for row in corpus} neg_lookup = {row["passage_id"]: row for row in hard_negs} # Pick a query q = queries[0] print(f"Question: {q['question']}") # Positive passage pos = passage_lookup[q["passage_id"]] print(f"Positive: {pos['content'][:200]}...") # Hard negatives hn = neg_lookup[q["passage_id"]] print(f"Positive score: {hn['pos_score']:.4f}") for k in range(1, 4): nid = hn[f"neg_{k}_id"] nscore = hn[f"neg_{k}_score"] if nid: neg = passage_lookup[nid] print(f"Neg-{k} [score={nscore:.4f}]: {neg['content'][:200]}...") ``` ### Example Output ``` Question: 2006/2007-ci il Azərbaycan kubokunda "Xəzər Lənkəran" hansı mərhələdə yarışa qoşuldu? ✅ Positive [score=6.3750]: 2006/2007-ci il Azərbaycan kubokuna "Xəzər Lənkəran" 1/8 final mərhələsində qoşuldu. Lənkəran təmsilçisi "Bakılı" klubunu 4:0 və 3:0 məğlub edərək növbəti mərhələyə keçdi. 1/4 final mərhələsində Lənkəran təmsilçisinin rəqibi "Bakı FK" oldu... ❌ Neg-1 [score=5.9414]: Daha dəqiq olan Lənkəran təmsilçisi 3:5 hesablı qələbə qazandı və növbəti mərhələyə keçdi. 1/4 final mərhələsində rəqib Bakının "Rəvan" klubu oldu. "Xəzər Lənkəran" hər iki oyunda qalib gəldi (1:2 və 4:1) və növbəti mərhələyə keçdi... ❌ Neg-2 [score=3.2168]: Rəqib Gəncənin "Kəpəz" klubu oldu. Reqlamentə əsasən cütlüyün taleyi 1 oyunda həll olundu. 1:0 hesablı qələbə qazanan "Xəzər Lənkəran" növbəti mərhələyə keçdi... ❌ Neg-3 [score=2.6895]: Ölkə birinciliyində Yakuba Bamba və Edmond Ntiamoah 5, Rəşad Abdullayev və Mario Serjio Souza 4, Emin Quliyev, Nadir Nəbiyev və Junior Osvaldo 3, Elmar Baxşıyev 2... ``` ## Contact For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].
提供机构:
LocalDoc
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作