hotchpotch/wikipedia-multilingual-ir-pairs
收藏Hugging Face2026-02-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hotchpotch/wikipedia-multilingual-ir-pairs
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: arwiki-20251202-v1.1.0
features:
- name: query
dtype: string
- name: document
dtype: string
splits:
- name: train
num_bytes: 2745141765
num_examples: 2985795
download_size: 1223732465
dataset_size: 2745141765
- config_name: dewiki-20251202-v1.1.0
features:
- name: query
dtype: string
- name: document
dtype: string
splits:
- name: train
num_bytes: 8578299483
num_examples: 12740626
download_size: 5055301850
dataset_size: 8578299483
- config_name: enwiki-20251202-v1.1.0
features:
- name: query
dtype: string
- name: document
dtype: string
splits:
- name: train
num_bytes: 19836199920
num_examples: 29568452
download_size: 11364138347
dataset_size: 19836199920
- config_name: eswiki-20251202-v1.1.0
features:
- name: query
dtype: string
- name: document
dtype: string
splits:
- name: train
num_bytes: 5869299882
num_examples: 8514623
download_size: 3353597885
dataset_size: 5869299882
- config_name: frwiki-20251202-v1.1.0
features:
- name: query
dtype: string
- name: document
dtype: string
splits:
- name: train
num_bytes: 7423992880
num_examples: 11282018
download_size: 4194422772
dataset_size: 7423992880
- config_name: itwiki-20251202-v1.1.0
features:
- name: query
dtype: string
- name: document
dtype: string
splits:
- name: train
num_bytes: 4458299921
num_examples: 6941259
download_size: 2635970944
dataset_size: 4458299921
- config_name: jawiki-20251202-v1.1.0
features:
- name: query
dtype: string
- name: document
dtype: string
splits:
- name: train
num_bytes: 5475467951
num_examples: 5864346
download_size: 3069517233
dataset_size: 5475467951
- config_name: kowiki-20251202-v1.1.0
features:
- name: query
dtype: string
- name: document
dtype: string
splits:
- name: train
num_bytes: 1315741194
num_examples: 1777129
download_size: 742568231
dataset_size: 1315741194
- config_name: ptwiki-20251202-v1.1.0
features:
- name: query
dtype: string
- name: document
dtype: string
splits:
- name: train
num_bytes: 2652376872
num_examples: 3937511
download_size: 1526404781
dataset_size: 2652376872
- config_name: ruwiki-20251202-v1.1.0
features:
- name: query
dtype: string
- name: document
dtype: string
splits:
- name: train
num_bytes: 9417472444
num_examples: 8299364
download_size: 4441122770
dataset_size: 9417472444
- config_name: zhwiki-20251202-v1.1.0
features:
- name: query
dtype: string
- name: document
dtype: string
splits:
- name: train
num_bytes: 2393618967
num_examples: 3846300
download_size: 1525516522
dataset_size: 2393618967
configs:
- config_name: arwiki-20251202-v1.1.0
data_files:
- split: train
path: arwiki-20251202-v1.1.0/train-*
- config_name: dewiki-20251202-v1.1.0
data_files:
- split: train
path: dewiki-20251202-v1.1.0/train-*
- config_name: enwiki-20251202-v1.1.0
data_files:
- split: train
path: enwiki-20251202-v1.1.0/train-*
- config_name: eswiki-20251202-v1.1.0
data_files:
- split: train
path: eswiki-20251202-v1.1.0/train-*
- config_name: frwiki-20251202-v1.1.0
data_files:
- split: train
path: frwiki-20251202-v1.1.0/train-*
- config_name: itwiki-20251202-v1.1.0
data_files:
- split: train
path: itwiki-20251202-v1.1.0/train-*
- config_name: jawiki-20251202-v1.1.0
data_files:
- split: train
path: jawiki-20251202-v1.1.0/train-*
- config_name: kowiki-20251202-v1.1.0
data_files:
- split: train
path: kowiki-20251202-v1.1.0/train-*
- config_name: ptwiki-20251202-v1.1.0
data_files:
- split: train
path: ptwiki-20251202-v1.1.0/train-*
- config_name: ruwiki-20251202-v1.1.0
data_files:
- split: train
path: ruwiki-20251202-v1.1.0/train-*
- config_name: zhwiki-20251202-v1.1.0
data_files:
- split: train
path: zhwiki-20251202-v1.1.0/train-*
---
# wikipedia-multilingual-ir-pairs
🚧 This dataset is under active development and may change.
This dataset is designed for multilingual information retrieval training.
Compared with raw Wikipedia paragraph dumps, it provides cleaner and more practical supervision by pairing title/section-driven queries with relevant paragraph-level documents and applying rule-based filtering to remove low-value sections and noisy fragments.
It is intended for IR model training, including contrastive learning and retrieval/reranking objectives.
## Dataset at a glance
- Task: IR training (retrieval and reranking)
- Fields: `query` (Wikipedia title, or title + section header), `document` (paragraph text from the corresponding article section)
- Example query construction: if title is `Kyoto` and section header is `History`, the query becomes `Kyoto History` and is paired with the corresponding section text.
- Total rows: **95,757,423**
- Language subsets: **11**
- Source dataset: [singletongue/wikipedia-paragraphs](https://huggingface.co/datasets/singletongue/wikipedia-paragraphs)
## Language Subsets
| Subset | Language | Rows |
| --- | --- | ---: |
| arwiki-20251202-v1.1.0 | Arabic | 2,985,795 |
| dewiki-20251202-v1.1.0 | German | 12,740,626 |
| enwiki-20251202-v1.1.0 | English | 29,568,452 |
| eswiki-20251202-v1.1.0 | Spanish | 8,514,623 |
| frwiki-20251202-v1.1.0 | French | 11,282,018 |
| itwiki-20251202-v1.1.0 | Italian | 6,941,259 |
| jawiki-20251202-v1.1.0 | Japanese | 5,864,346 |
| kowiki-20251202-v1.1.0 | Korean | 1,777,129 |
| ptwiki-20251202-v1.1.0 | Portuguese | 3,937,511 |
| ruwiki-20251202-v1.1.0 | Russian | 8,299,364 |
| zhwiki-20251202-v1.1.0 | Chinese | 3,846,300 |
## Dataset Creation Process (Rough)
The pipeline is implemented in [`scripts/build_subheader_filtered_hf_ds.py`](https://huggingface.co/datasets/hotchpotch/wikipedia-multilingual-ir-pairs/blob/main/scripts/build_subheader_filtered_hf_ds.py).
1. Load each latest language subset from `singletongue/wikipedia-paragraphs`.
2. Filter out non-article pages and low-value sections (for example, "See also", references, external links; language-specific rules).
3. Apply additional quality filters (short/noisy list fragment removal, disambiguation lead gates, query-length constraints, and dropping pairs where `len(query) > len(document)`).
4. Split and merge paragraphs into practical chunks (character-length targets: 1000 for non-CJK, 600 for Japanese/Korean, 500 for Chinese; newline-aware splitting with an optional split cap; balanced merge for consecutive paragraphs in the same section; and a final hard cap on document length).
5. Save each language subset with `query` / `document` columns.
## Acknowledgements
Thank you for publishing the very useful dataset [singletongue/wikipedia-paragraphs](https://huggingface.co/datasets/singletongue/wikipedia-paragraphs).
This dataset made it possible to build this IR-focused dataset quickly.
## License
This dataset follows Wikipedia licensing:
- CC-BY-SA 4.0
- GFDL
提供机构:
hotchpotch



