five

hotchpotch/wikipedia-multilingual-ir-pairs

收藏
Hugging Face2026-02-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hotchpotch/wikipedia-multilingual-ir-pairs
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: arwiki-20251202-v1.1.0 features: - name: query dtype: string - name: document dtype: string splits: - name: train num_bytes: 2745141765 num_examples: 2985795 download_size: 1223732465 dataset_size: 2745141765 - config_name: dewiki-20251202-v1.1.0 features: - name: query dtype: string - name: document dtype: string splits: - name: train num_bytes: 8578299483 num_examples: 12740626 download_size: 5055301850 dataset_size: 8578299483 - config_name: enwiki-20251202-v1.1.0 features: - name: query dtype: string - name: document dtype: string splits: - name: train num_bytes: 19836199920 num_examples: 29568452 download_size: 11364138347 dataset_size: 19836199920 - config_name: eswiki-20251202-v1.1.0 features: - name: query dtype: string - name: document dtype: string splits: - name: train num_bytes: 5869299882 num_examples: 8514623 download_size: 3353597885 dataset_size: 5869299882 - config_name: frwiki-20251202-v1.1.0 features: - name: query dtype: string - name: document dtype: string splits: - name: train num_bytes: 7423992880 num_examples: 11282018 download_size: 4194422772 dataset_size: 7423992880 - config_name: itwiki-20251202-v1.1.0 features: - name: query dtype: string - name: document dtype: string splits: - name: train num_bytes: 4458299921 num_examples: 6941259 download_size: 2635970944 dataset_size: 4458299921 - config_name: jawiki-20251202-v1.1.0 features: - name: query dtype: string - name: document dtype: string splits: - name: train num_bytes: 5475467951 num_examples: 5864346 download_size: 3069517233 dataset_size: 5475467951 - config_name: kowiki-20251202-v1.1.0 features: - name: query dtype: string - name: document dtype: string splits: - name: train num_bytes: 1315741194 num_examples: 1777129 download_size: 742568231 dataset_size: 1315741194 - config_name: ptwiki-20251202-v1.1.0 features: - name: query dtype: string - name: document dtype: string splits: - name: train num_bytes: 2652376872 num_examples: 3937511 download_size: 1526404781 dataset_size: 2652376872 - config_name: ruwiki-20251202-v1.1.0 features: - name: query dtype: string - name: document dtype: string splits: - name: train num_bytes: 9417472444 num_examples: 8299364 download_size: 4441122770 dataset_size: 9417472444 - config_name: zhwiki-20251202-v1.1.0 features: - name: query dtype: string - name: document dtype: string splits: - name: train num_bytes: 2393618967 num_examples: 3846300 download_size: 1525516522 dataset_size: 2393618967 configs: - config_name: arwiki-20251202-v1.1.0 data_files: - split: train path: arwiki-20251202-v1.1.0/train-* - config_name: dewiki-20251202-v1.1.0 data_files: - split: train path: dewiki-20251202-v1.1.0/train-* - config_name: enwiki-20251202-v1.1.0 data_files: - split: train path: enwiki-20251202-v1.1.0/train-* - config_name: eswiki-20251202-v1.1.0 data_files: - split: train path: eswiki-20251202-v1.1.0/train-* - config_name: frwiki-20251202-v1.1.0 data_files: - split: train path: frwiki-20251202-v1.1.0/train-* - config_name: itwiki-20251202-v1.1.0 data_files: - split: train path: itwiki-20251202-v1.1.0/train-* - config_name: jawiki-20251202-v1.1.0 data_files: - split: train path: jawiki-20251202-v1.1.0/train-* - config_name: kowiki-20251202-v1.1.0 data_files: - split: train path: kowiki-20251202-v1.1.0/train-* - config_name: ptwiki-20251202-v1.1.0 data_files: - split: train path: ptwiki-20251202-v1.1.0/train-* - config_name: ruwiki-20251202-v1.1.0 data_files: - split: train path: ruwiki-20251202-v1.1.0/train-* - config_name: zhwiki-20251202-v1.1.0 data_files: - split: train path: zhwiki-20251202-v1.1.0/train-* --- # wikipedia-multilingual-ir-pairs 🚧 This dataset is under active development and may change. This dataset is designed for multilingual information retrieval training. Compared with raw Wikipedia paragraph dumps, it provides cleaner and more practical supervision by pairing title/section-driven queries with relevant paragraph-level documents and applying rule-based filtering to remove low-value sections and noisy fragments. It is intended for IR model training, including contrastive learning and retrieval/reranking objectives. ## Dataset at a glance - Task: IR training (retrieval and reranking) - Fields: `query` (Wikipedia title, or title + section header), `document` (paragraph text from the corresponding article section) - Example query construction: if title is `Kyoto` and section header is `History`, the query becomes `Kyoto History` and is paired with the corresponding section text. - Total rows: **95,757,423** - Language subsets: **11** - Source dataset: [singletongue/wikipedia-paragraphs](https://huggingface.co/datasets/singletongue/wikipedia-paragraphs) ## Language Subsets | Subset | Language | Rows | | --- | --- | ---: | | arwiki-20251202-v1.1.0 | Arabic | 2,985,795 | | dewiki-20251202-v1.1.0 | German | 12,740,626 | | enwiki-20251202-v1.1.0 | English | 29,568,452 | | eswiki-20251202-v1.1.0 | Spanish | 8,514,623 | | frwiki-20251202-v1.1.0 | French | 11,282,018 | | itwiki-20251202-v1.1.0 | Italian | 6,941,259 | | jawiki-20251202-v1.1.0 | Japanese | 5,864,346 | | kowiki-20251202-v1.1.0 | Korean | 1,777,129 | | ptwiki-20251202-v1.1.0 | Portuguese | 3,937,511 | | ruwiki-20251202-v1.1.0 | Russian | 8,299,364 | | zhwiki-20251202-v1.1.0 | Chinese | 3,846,300 | ## Dataset Creation Process (Rough) The pipeline is implemented in [`scripts/build_subheader_filtered_hf_ds.py`](https://huggingface.co/datasets/hotchpotch/wikipedia-multilingual-ir-pairs/blob/main/scripts/build_subheader_filtered_hf_ds.py). 1. Load each latest language subset from `singletongue/wikipedia-paragraphs`. 2. Filter out non-article pages and low-value sections (for example, "See also", references, external links; language-specific rules). 3. Apply additional quality filters (short/noisy list fragment removal, disambiguation lead gates, query-length constraints, and dropping pairs where `len(query) > len(document)`). 4. Split and merge paragraphs into practical chunks (character-length targets: 1000 for non-CJK, 600 for Japanese/Korean, 500 for Chinese; newline-aware splitting with an optional split cap; balanced merge for consecutive paragraphs in the same section; and a final hard cap on document length). 5. Save each language subset with `query` / `document` columns. ## Acknowledgements Thank you for publishing the very useful dataset [singletongue/wikipedia-paragraphs](https://huggingface.co/datasets/singletongue/wikipedia-paragraphs). This dataset made it possible to build this IR-focused dataset quickly. ## License This dataset follows Wikipedia licensing: - CC-BY-SA 4.0 - GFDL
提供机构:
hotchpotch
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作