five

LocalDoc/azerbaijani_books_retriever_corpus-reranked

收藏
Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LocalDoc/azerbaijani_books_retriever_corpus-reranked
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: corpus features: - name: passage_id dtype: large_string - name: content dtype: large_string splits: - name: train num_bytes: 1214277260 num_examples: 570573 download_size: 646910354 dataset_size: 1214277260 - config_name: hard_negatives features: - name: query_id dtype: string - name: passage_id dtype: string - name: pos_score dtype: float64 - name: neg_1_id dtype: string - name: neg_1_score dtype: float64 - name: neg_2_id dtype: string - name: neg_2_score dtype: float64 - name: neg_3_id dtype: string - name: neg_3_score dtype: float64 - name: neg_4_id dtype: string - name: neg_4_score dtype: float64 - name: neg_5_id dtype: string - name: neg_5_score dtype: float64 - name: neg_6_id dtype: string - name: neg_6_score dtype: float64 - name: neg_7_id dtype: string - name: neg_7_score dtype: float64 - name: neg_8_id dtype: string - name: neg_8_score dtype: float64 - name: neg_9_id dtype: string - name: neg_9_score dtype: float64 - name: neg_10_id dtype: string - name: neg_10_score dtype: float64 splits: - name: train num_bytes: 512278813 num_examples: 1616877 download_size: 288801845 dataset_size: 512278813 - config_name: queries features: - name: query_id dtype: string - name: passage_id dtype: string - name: query dtype: string - name: query_type dtype: string splits: - name: train num_bytes: 190703388 num_examples: 1616877 download_size: 67508957 dataset_size: 190703388 configs: - config_name: corpus data_files: - split: train path: corpus/train-* - config_name: hard_negatives data_files: - split: train path: hard_negatives/train-* - config_name: queries data_files: - split: train path: queries/train-* license: cc-by-4.0 task_categories: - sentence-similarity language: - az tags: - retrieval - books - azerbaijani pretty_name: Azerbaijani Books Retrieval Dataset (Reranked) size_categories: - 1M<n<10M --- # Azerbaijani Books Retrieval Dataset (Reranked) A large-scale retrieval dataset built from [LocalDoc/books_dataset](https://huggingface.co/datasets/LocalDoc/books_dataset) — a collection of 2,804 Azerbaijani-language books with 7.8M sentences spanning politics, history, literature, science, and more. Designed for training and evaluating information retrieval, semantic search, and RAG pipelines in Azerbaijani. ## Dataset Configs The dataset consists of three configs that can be joined via `passage_id` and `query_id`: ### `corpus` The passage collection — one row per unique content passage. | Column | Description | |---|---| | `passage_id` | Unique identifier of the passage (SHA-256 prefix) | | `content` | The text passage (up to ~2000 characters) | ### `queries` Three queries per passage (question, statement, keyword), each as a separate row. | Column | Description | |---|---| | `query_id` | Unique query identifier (row index) | | `passage_id` | Links to the relevant passage in `corpus` | | `query` | The query text in Azerbaijani | | `query_type` | One of: `question`, `statement`, `keyword` | ### `hard_negatives` BM25-mined hard negatives scored by a cross-encoder reranker ([BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)). Each row contains up to 10 hard negative passage IDs with their reranker scores. | Column | Description | |---|---| | `query_id` | Links to the query in `queries` | | `passage_id` | Positive passage ID (links to `corpus`) | | `pos_score` | Reranker score of the positive passage | | `neg_{k}_id` | passage_id of the k-th hard negative | | `neg_{k}_score` | Reranker score of the k-th hard negative | ## Source Dataset Based on [LocalDoc/books_dataset](https://huggingface.co/datasets/LocalDoc/books_dataset) which contains 7.8M sentences from 2,804 Azerbaijani-language books. The original dataset provides sentence-level text with metadata (title, author, year, publisher, category). Sentences were reassembled by book ID and chunked into passages of up to ~2000 characters. ## Query Generation For each passage chunk, three types of search queries were generated using an LLM: - **question** — a natural question in Azerbaijani (e.g., "Hansı ölkələr enerji sahəsində əməkdaşlıq edir?") - **statement** — a declarative statement describing the passage topic (e.g., "Azərbaycan-Türkiyə enerji əməkdaşlığı") - **keyword** — a short keyword-style search query, 2–5 words (e.g., "enerji əməkdaşlıq TANAP qaz") ## Hard Negative Mining Pipeline 1. Book sentences were reassembled by ID and chunked into passages (~2000 chars max) 2. Unique passages were deduplicated by content 3. For each query, top-100 candidates were retrieved using BM25 4. The positive passage was excluded from candidates 5. Each candidate was scored with a cross-encoder reranker (BAAI/bge-reranker-v2-m3) 6. Candidates with scores above 95% of the positive score were filtered out as likely false negatives 7. Top-10 remaining negatives were kept, sorted by score (hardest first) ## Statistics | Config | Rows | |---|---| | `corpus` | 570,573 passages | | `queries` | 1,616,877 queries | | `hard_negatives` | 1,616,877 rows × 10 negatives | ## Example ```python from datasets import load_dataset corpus = load_dataset("LocalDoc/azerbaijani_books_retriever_corpus-reranked", "corpus")["train"] queries = load_dataset("LocalDoc/azerbaijani_books_retriever_corpus-reranked", "queries")["train"] hard_negs = load_dataset("LocalDoc/azerbaijani_books_retriever_corpus-reranked", "hard_negatives")["train"] # Build lookups passage_lookup = {row["passage_id"]: row for row in corpus} neg_lookup = {row["query_id"]: row for row in hard_negs} # Pick a query q = queries[0] print(f"Query [{q['query_type']}]: {q['query']}") # Positive passage pos = passage_lookup[q["passage_id"]] print(f"Positive: {pos['content'][:200]}...") # Hard negatives hn = neg_lookup[q["query_id"]] print(f"Positive score: {hn['pos_score']:.4f}") for k in range(1, 4): nid = hn[f"neg_{k}_id"] nscore = hn[f"neg_{k}_score"] if nid: neg = passage_lookup[nid] print(f"Neg-{k} [score={nscore:.4f}]: {neg['content'][:200]}...") ``` ### Example Output ``` Query [question]: Türkiyə-Yunanıstan qaz kəməri və Şahdəniz-2 layihəsi Avropanın enerji təhlükəsizliyinə necə təsir göstərir? ✅ Positive [score=5.3242]: Hazırda Yunanıstan Avropa İttifaqının üzvü olaenerji təhlükəsizliyinin təmin edilməsində önəmli rol oyna- raq, enerji sahəsində Azərbaycanla birbaşa əməkdaşlıq edir və yacaqdır. onun az miqdarda olsa da ixrac qazını alır... ❌ Neg-1 [score=4.4609]: Bu iki tarixi layihə bizi bir-birimizə çox sıx şəkildə bağlamaqdadır. Bu mənada bir məqamı xüsusi vurğulamaq istəyirəm: Biz Xəzər hövzəsi və Orta Asiya təbii qazının ölkəmizin ərazisindən alternativ marşrutlarla Avropaya nəqlini nəzərdə tutan... ❌ Neg-2 [score=4.2500]: «Şahdəniz» yatağında qaz ehtiyatları 1 trilyon kubmetrdən çoxdur. Ümumiyyətlə, Azərbaycanın digər yataqları ilə birlikdə qaz ehtiyatları 2.6 trilyon kubmetr təşkil edir. İkinci layihə Azərbaycanı Gürcüstanla birləşdirən... ❌ Neg-3 [score=4.2109]: TANAP Azərbaycan xalqının böyük lideri, mənim dostum İlham Əliyevin rəhbərliyi ilə Azərbaycanın enerji təhlükəsizliyi, türk xalqının böyük lideri, Ukraynanın dostu və mənim dostum Türkiyə Prezidenti Rəcəb Tayyib Ərdoğanın rəhbərliyi ilə... ``` ## Contact For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].
提供机构:
LocalDoc
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作