hotchpotch/bge-m3-data-finetune-unified
收藏Hugging Face2025-12-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hotchpotch/bge-m3-data-finetune-unified
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: ATEC_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 5970006
num_examples: 11325
download_size: 2923950
dataset_size: 5970006
- config_name: BQ_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 6220822
num_examples: 12599
download_size: 1239675
dataset_size: 6220822
- config_name: LCQMC_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 4484444
num_examples: 10000
download_size: 2878253
dataset_size: 4484444
- config_name: PAWSX_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 11027815
num_examples: 10000
download_size: 7932138
dataset_size: 11027815
- config_name: QBQTC_v2_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 7874062
num_examples: 10000
download_size: 5521390
dataset_size: 7874062
- config_name: STS-B_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 173179
num_examples: 249
download_size: 106741
dataset_size: 173179
- config_name: afqmc_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 5575132
num_examples: 10534
download_size: 2486919
dataset_size: 5575132
- config_name: cMedQAv2_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 1506313240
num_examples: 50000
download_size: 526296863
dataset_size: 1506313240
- config_name: colliee_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 5782070
num_examples: 463
download_size: 267643
dataset_size: 5782070
- config_name: dureader_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 433283844
num_examples: 35172
download_size: 286918744
dataset_size: 433283844
- config_name: dureader_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 275880594
num_examples: 13545
download_size: 176219623
dataset_size: 275880594
- config_name: dureader_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 60711238
num_examples: 2344
download_size: 38473060
dataset_size: 60711238
- config_name: dureader_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 19261817
num_examples: 643
download_size: 11863758
dataset_size: 19261817
- config_name: dureader_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 7228271
num_examples: 198
download_size: 4153618
dataset_size: 7228271
- config_name: dureader_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 447803437
num_examples: 28124
download_size: 291713807
dataset_size: 447803437
- config_name: dureader_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 5375932
num_examples: 140
download_size: 3009507
dataset_size: 5375932
- config_name: dureader_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 3174783
num_examples: 68
download_size: 1601924
dataset_size: 3174783
- config_name: dureader_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 11348713
num_examples: 182
download_size: 6203652
dataset_size: 11348713
- config_name: hotpotqa_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 747412404
num_examples: 84228
download_size: 443806134
dataset_size: 747412404
- config_name: hotpotqa_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 3454175
num_examples: 288
download_size: 1945372
dataset_size: 3454175
- config_name: law_gpt_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 17139477
num_examples: 500
download_size: 5342913
dataset_size: 17139477
- config_name: lecardv2_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 181165743
num_examples: 591
download_size: 83120086
dataset_size: 181165743
- config_name: miracl_ar_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 33660840
num_examples: 422
download_size: 16063554
dataset_size: 33660840
- config_name: miracl_ar_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 96911593
num_examples: 885
download_size: 47441929
dataset_size: 96911593
- config_name: miracl_ar_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 13308541
num_examples: 111
download_size: 6591267
dataset_size: 13308541
- config_name: miracl_ar_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 3925681
num_examples: 33
download_size: 1947250
dataset_size: 3925681
- config_name: miracl_ar_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1588053
num_examples: 13
download_size: 683473
dataset_size: 1588053
- config_name: miracl_ar_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 194519986
num_examples: 2012
download_size: 94534205
dataset_size: 194519986
- config_name: miracl_ar_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1101307
num_examples: 8
download_size: 479455
dataset_size: 1101307
- config_name: miracl_ar_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 384386
num_examples: 3
download_size: 150461
dataset_size: 384386
- config_name: miracl_ar_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1356440
num_examples: 8
download_size: 603263
dataset_size: 1356440
- config_name: miracl_bn_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 13751152
num_examples: 98
download_size: 4991617
dataset_size: 13751152
- config_name: miracl_bn_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 81634211
num_examples: 451
download_size: 30221169
dataset_size: 81634211
- config_name: miracl_bn_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 6226270
num_examples: 34
download_size: 2363654
dataset_size: 6226270
- config_name: miracl_bn_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 4077505
num_examples: 20
download_size: 1493164
dataset_size: 4077505
- config_name: miracl_bn_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 3006630
num_examples: 13
download_size: 1009933
dataset_size: 3006630
- config_name: miracl_bn_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 167716118
num_examples: 1008
download_size: 61905762
dataset_size: 167716118
- config_name: miracl_bn_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1391958
num_examples: 7
download_size: 415031
dataset_size: 1391958
- config_name: miracl_en_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 73666474
num_examples: 1193
download_size: 41337112
dataset_size: 73666474
- config_name: miracl_en_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 10297963
num_examples: 128
download_size: 5831367
dataset_size: 10297963
- config_name: miracl_en_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 270694
num_examples: 3
download_size: 164323
dataset_size: 270694
- config_name: miracl_en_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 212426
num_examples: 2
download_size: 128411
dataset_size: 212426
- config_name: miracl_en_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 111593378
num_examples: 1537
download_size: 63131660
dataset_size: 111593378
- config_name: miracl_es_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 53995069
num_examples: 856
download_size: 30753052
dataset_size: 53995069
- config_name: miracl_es_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 9690352
num_examples: 120
download_size: 5667781
dataset_size: 9690352
- config_name: miracl_es_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1063917
num_examples: 12
download_size: 536712
dataset_size: 1063917
- config_name: miracl_es_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 189936
num_examples: 2
download_size: 127782
dataset_size: 189936
- config_name: miracl_es_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 237177
num_examples: 3
download_size: 134722
dataset_size: 237177
- config_name: miracl_es_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 85709592
num_examples: 1169
download_size: 49811576
dataset_size: 85709592
- config_name: miracl_fa_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 45973561
num_examples: 602
download_size: 20800550
dataset_size: 45973561
- config_name: miracl_fa_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 25076173
num_examples: 228
download_size: 11695842
dataset_size: 25076173
- config_name: miracl_fa_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1877900
num_examples: 16
download_size: 866171
dataset_size: 1877900
- config_name: miracl_fa_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 670773
num_examples: 5
download_size: 290935
dataset_size: 670773
- config_name: miracl_fa_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 123323855
num_examples: 1255
download_size: 57085616
dataset_size: 123323855
- config_name: miracl_fa_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 200342
num_examples: 1
download_size: 114077
dataset_size: 200342
- config_name: miracl_fi_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 113069955
num_examples: 2098
download_size: 67054506
dataset_size: 113069955
- config_name: miracl_fi_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 2258878
num_examples: 34
download_size: 1307244
dataset_size: 2258878
- config_name: miracl_fi_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 337147
num_examples: 5
download_size: 206650
dataset_size: 337147
- config_name: miracl_fi_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 47004520
num_examples: 760
download_size: 28059086
dataset_size: 47004520
- config_name: miracl_fr_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 22609769
num_examples: 404
download_size: 12682212
dataset_size: 22609769
- config_name: miracl_fr_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 4843877
num_examples: 68
download_size: 2750528
dataset_size: 4843877
- config_name: miracl_fr_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 315099
num_examples: 4
download_size: 193193
dataset_size: 315099
- config_name: miracl_fr_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 44881167
num_examples: 667
download_size: 25679983
dataset_size: 44881167
- config_name: miracl_hi_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 11923549
num_examples: 89
download_size: 4441449
dataset_size: 11923549
- config_name: miracl_hi_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 48056784
num_examples: 259
download_size: 18256392
dataset_size: 48056784
- config_name: miracl_hi_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 5337208
num_examples: 28
download_size: 2018607
dataset_size: 5337208
- config_name: miracl_hi_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1736752
num_examples: 8
download_size: 628984
dataset_size: 1736752
- config_name: miracl_hi_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1166966
num_examples: 6
download_size: 410643
dataset_size: 1166966
- config_name: miracl_hi_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 126642415
num_examples: 775
download_size: 47580319
dataset_size: 126642415
- config_name: miracl_hi_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 196066
num_examples: 1
download_size: 91078
dataset_size: 196066
- config_name: miracl_hi_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 784493
num_examples: 3
download_size: 264962
dataset_size: 784493
- config_name: miracl_id_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 129680104
num_examples: 2055
download_size: 70809101
dataset_size: 129680104
- config_name: miracl_id_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 21065645
num_examples: 275
download_size: 11603564
dataset_size: 21065645
- config_name: miracl_id_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1756259
num_examples: 20
download_size: 896099
dataset_size: 1756259
- config_name: miracl_id_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 87990
num_examples: 1
download_size: 59630
dataset_size: 87990
- config_name: miracl_id_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 122212431
num_examples: 1720
download_size: 67569192
dataset_size: 122212431
- config_name: miracl_ja_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 89519591
num_examples: 1478
download_size: 50298380
dataset_size: 89519591
- config_name: miracl_ja_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 15071822
num_examples: 191
download_size: 8401966
dataset_size: 15071822
- config_name: miracl_ja_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 411359
num_examples: 5
download_size: 241606
dataset_size: 411359
- config_name: miracl_ja_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 390260
num_examples: 5
download_size: 216049
dataset_size: 390260
- config_name: miracl_ja_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 433001
num_examples: 5
download_size: 228832
dataset_size: 433001
- config_name: miracl_ja_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 126912129
num_examples: 1790
download_size: 71428154
dataset_size: 126912129
- config_name: miracl_ja_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 105225
num_examples: 1
download_size: 72144
dataset_size: 105225
- config_name: miracl_ja_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 245928
num_examples: 2
download_size: 116153
dataset_size: 245928
- config_name: miracl_ko_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 12370241
num_examples: 211
download_size: 7011751
dataset_size: 12370241
- config_name: miracl_ko_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 8210966
num_examples: 106
download_size: 4682078
dataset_size: 8210966
- config_name: miracl_ko_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 569626
num_examples: 7
download_size: 328405
dataset_size: 569626
- config_name: miracl_ko_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 112087
num_examples: 1
download_size: 73220
dataset_size: 112087
- config_name: miracl_ko_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 38044978
num_examples: 541
download_size: 21827044
dataset_size: 38044978
- config_name: miracl_ko_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 82212
num_examples: 1
download_size: 48315
dataset_size: 82212
- config_name: miracl_ko_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 148931
num_examples: 1
download_size: 100090
dataset_size: 148931
- config_name: miracl_ru_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 121766274
num_examples: 1255
download_size: 59038249
dataset_size: 121766274
- config_name: miracl_ru_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 54376767
num_examples: 416
download_size: 26830875
dataset_size: 54376767
- config_name: miracl_ru_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 3231796
num_examples: 23
download_size: 1604726
dataset_size: 3231796
- config_name: miracl_ru_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 432793
num_examples: 3
download_size: 208938
dataset_size: 432793
- config_name: miracl_ru_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 415602
num_examples: 3
download_size: 212040
dataset_size: 415602
- config_name: miracl_ru_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 347133903
num_examples: 2982
download_size: 171233519
dataset_size: 347133903
- config_name: miracl_ru_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 143313
num_examples: 1
download_size: 87030
dataset_size: 143313
- config_name: miracl_sw_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 38111232
num_examples: 1129
download_size: 20999606
dataset_size: 38111232
- config_name: miracl_sw_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 7478214
num_examples: 132
download_size: 4261592
dataset_size: 7478214
- config_name: miracl_sw_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1554004
num_examples: 35
download_size: 444883
dataset_size: 1554004
- config_name: miracl_sw_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 28309175
num_examples: 605
download_size: 16278061
dataset_size: 28309175
- config_name: miracl_te_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 2318352
num_examples: 18
download_size: 831887
dataset_size: 2318352
- config_name: miracl_te_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 512875325
num_examples: 2349
download_size: 158941642
dataset_size: 512875325
- config_name: miracl_te_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 39530039
num_examples: 166
download_size: 13176546
dataset_size: 39530039
- config_name: miracl_te_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 7128577
num_examples: 30
download_size: 2550785
dataset_size: 7128577
- config_name: miracl_te_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 3843812
num_examples: 16
download_size: 1514525
dataset_size: 3843812
- config_name: miracl_te_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 159402673
num_examples: 862
download_size: 61311898
dataset_size: 159402673
- config_name: miracl_te_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 2974515
num_examples: 11
download_size: 1083282
dataset_size: 2974515
- config_name: miracl_th_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 128795901
num_examples: 933
download_size: 47523326
dataset_size: 128795901
- config_name: miracl_th_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 50999684
num_examples: 302
download_size: 19011281
dataset_size: 50999684
- config_name: miracl_th_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 12976578
num_examples: 68
download_size: 4802677
dataset_size: 12976578
- config_name: miracl_th_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 3559821
num_examples: 15
download_size: 1304327
dataset_size: 3559821
- config_name: miracl_th_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 3826991
num_examples: 20
download_size: 1315481
dataset_size: 3826991
- config_name: miracl_th_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 261969342
num_examples: 1634
download_size: 97305393
dataset_size: 261969342
- config_name: miracl_zh_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 32330558
num_examples: 642
download_size: 20739700
dataset_size: 32330558
- config_name: miracl_zh_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 4329136
num_examples: 67
download_size: 2779207
dataset_size: 4329136
- config_name: miracl_zh_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 189128
num_examples: 3
download_size: 137739
dataset_size: 189128
- config_name: miracl_zh_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 35265023
num_examples: 600
download_size: 23012711
dataset_size: 35265023
- config_name: mldr_ar_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 893721
num_examples: 4
download_size: 407342
dataset_size: 893721
- config_name: mldr_ar_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 373320
num_examples: 2
download_size: 152026
dataset_size: 373320
- config_name: mldr_ar_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 27924175
num_examples: 91
download_size: 13240216
dataset_size: 27924175
- config_name: mldr_ar_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 882182624
num_examples: 1720
download_size: 421130520
dataset_size: 882182624
- config_name: mldr_de_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 109333
num_examples: 1
download_size: 74254
dataset_size: 109333
- config_name: mldr_de_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 9534182
num_examples: 65
download_size: 5463891
dataset_size: 9534182
- config_name: mldr_de_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 513339675
num_examples: 1781
download_size: 291925417
dataset_size: 513339675
- config_name: mldr_en_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 834714
num_examples: 6
download_size: 450540
dataset_size: 834714
- config_name: mldr_en_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 6545614
num_examples: 38
download_size: 3650194
dataset_size: 6545614
- config_name: mldr_en_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 25274901
num_examples: 130
download_size: 14419449
dataset_size: 25274901
- config_name: mldr_en_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 61163372
num_examples: 280
download_size: 34796244
dataset_size: 61163372
- config_name: mldr_en_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 169378465
num_examples: 695
download_size: 96133520
dataset_size: 169378465
- config_name: mldr_en_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 3232103418
num_examples: 8851
download_size: 1799713395
dataset_size: 3232103418
- config_name: mldr_es_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 527480
num_examples: 3
download_size: 308917
dataset_size: 527480
- config_name: mldr_es_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 379021
num_examples: 2
download_size: 233224
dataset_size: 379021
- config_name: mldr_es_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 31986844
num_examples: 123
download_size: 18895987
dataset_size: 31986844
- config_name: mldr_es_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 770022188
num_examples: 2126
download_size: 449536327
dataset_size: 770022188
- config_name: mldr_fr_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 163375
num_examples: 1
download_size: 98688
dataset_size: 163375
- config_name: mldr_fr_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 322229
num_examples: 2
download_size: 176449
dataset_size: 322229
- config_name: mldr_fr_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 13065037
num_examples: 59
download_size: 7565693
dataset_size: 13065037
- config_name: mldr_fr_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 519513278
num_examples: 1546
download_size: 298756787
dataset_size: 519513278
- config_name: mldr_hi_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 4262782
num_examples: 10
download_size: 1587532
dataset_size: 4262782
- config_name: mldr_hi_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 53851863
num_examples: 102
download_size: 20033055
dataset_size: 53851863
- config_name: mldr_hi_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 1144693246
num_examples: 1506
download_size: 422841377
dataset_size: 1144693246
- config_name: mldr_it_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 427280
num_examples: 2
download_size: 247914
dataset_size: 427280
- config_name: mldr_it_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 10368973
num_examples: 40
download_size: 6087589
dataset_size: 10368973
- config_name: mldr_it_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 809724805
num_examples: 2109
download_size: 480868853
dataset_size: 809724805
- config_name: mldr_ja_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 148370
num_examples: 1
download_size: 93088
dataset_size: 148370
- config_name: mldr_ja_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 239548
num_examples: 1
download_size: 133215
dataset_size: 239548
- config_name: mldr_ja_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 1459218
num_examples: 6
download_size: 731686
dataset_size: 1459218
- config_name: mldr_ja_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 28534447
num_examples: 105
download_size: 16169058
dataset_size: 28534447
- config_name: mldr_ja_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 873780430
num_examples: 2149
download_size: 485339363
dataset_size: 873780430
- config_name: mldr_ko_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 630831
num_examples: 4
download_size: 358470
dataset_size: 630831
- config_name: mldr_ko_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 1277357
num_examples: 7
download_size: 712850
dataset_size: 1277357
- config_name: mldr_ko_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 201507
num_examples: 1
download_size: 118101
dataset_size: 201507
- config_name: mldr_ko_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 17287246
num_examples: 77
download_size: 9805654
dataset_size: 17287246
- config_name: mldr_ko_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 710658200
num_examples: 2109
download_size: 401885314
dataset_size: 710658200
- config_name: mldr_pt_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 688801
num_examples: 3
download_size: 382542
dataset_size: 688801
- config_name: mldr_pt_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 7710579
num_examples: 30
download_size: 4486709
dataset_size: 7710579
- config_name: mldr_pt_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 726184130
num_examples: 1812
download_size: 424739418
dataset_size: 726184130
- config_name: mldr_ru_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 269667
num_examples: 1
download_size: 139605
dataset_size: 269667
- config_name: mldr_ru_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 6549245
num_examples: 17
download_size: 3154413
dataset_size: 6549245
- config_name: mldr_ru_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 54668077
num_examples: 124
download_size: 26453691
dataset_size: 54668077
- config_name: mldr_ru_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 1068869887
num_examples: 1722
download_size: 513909995
dataset_size: 1068869887
- config_name: mldr_th_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 345041
num_examples: 4
download_size: 134956
dataset_size: 345041
- config_name: mldr_th_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 542343
num_examples: 5
download_size: 199138
dataset_size: 542343
- config_name: mldr_th_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 918047
num_examples: 8
download_size: 355232
dataset_size: 918047
- config_name: mldr_th_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 444431
num_examples: 4
download_size: 173604
dataset_size: 444431
- config_name: mldr_th_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 153711
num_examples: 1
download_size: 67370
dataset_size: 153711
- config_name: mldr_th_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 31638355
num_examples: 137
download_size: 11728640
dataset_size: 31638355
- config_name: mldr_th_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 823617348
num_examples: 1811
download_size: 313841308
dataset_size: 823617348
- config_name: mldr_zh_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 90373
num_examples: 1
download_size: 54737
dataset_size: 90373
- config_name: mldr_zh_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 105050
num_examples: 1
download_size: 47309
dataset_size: 105050
- config_name: mldr_zh_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 1221668
num_examples: 7
download_size: 604043
dataset_size: 1221668
- config_name: mldr_zh_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 8970561
num_examples: 44
download_size: 5562759
dataset_size: 8970561
- config_name: mldr_zh_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 34281928
num_examples: 137
download_size: 21377471
dataset_size: 34281928
- config_name: mldr_zh_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 155153682
num_examples: 550
download_size: 95006381
dataset_size: 155153682
- config_name: mldr_zh_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 3814797659
num_examples: 9260
download_size: 2356033496
dataset_size: 3814797659
- config_name: mmarco_chinese_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 514566583
num_examples: 100000
download_size: 325460355
dataset_size: 514566583
- config_name: mr-tydi_arabic_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 109831291
num_examples: 1400
download_size: 52934924
dataset_size: 109831291
- config_name: mr-tydi_arabic_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 364612496
num_examples: 3334
download_size: 179663040
dataset_size: 364612496
- config_name: mr-tydi_arabic_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 54296933
num_examples: 437
download_size: 26595161
dataset_size: 54296933
- config_name: mr-tydi_arabic_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 17810570
num_examples: 139
download_size: 8855937
dataset_size: 17810570
- config_name: mr-tydi_arabic_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 4293037
num_examples: 36
download_size: 2010238
dataset_size: 4293037
- config_name: mr-tydi_arabic_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 668941248
num_examples: 6994
download_size: 326870902
dataset_size: 668941248
- config_name: mr-tydi_arabic_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1200109
num_examples: 9
download_size: 470739
dataset_size: 1200109
- config_name: mr-tydi_arabic_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1663811
num_examples: 13
download_size: 428516
dataset_size: 1663811
- config_name: mr-tydi_arabic_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 2143387
num_examples: 15
download_size: 895954
dataset_size: 2143387
- config_name: mr-tydi_bengali_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 15178521
num_examples: 112
download_size: 5546971
dataset_size: 15178521
- config_name: mr-tydi_bengali_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 71166136
num_examples: 396
download_size: 26453230
dataset_size: 71166136
- config_name: mr-tydi_bengali_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 7469245
num_examples: 41
download_size: 2803707
dataset_size: 7469245
- config_name: mr-tydi_bengali_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 2375946
num_examples: 12
download_size: 863301
dataset_size: 2375946
- config_name: mr-tydi_bengali_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1981432
num_examples: 9
download_size: 680271
dataset_size: 1981432
- config_name: mr-tydi_bengali_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 182866914
num_examples: 1131
download_size: 67762515
dataset_size: 182866914
- config_name: mr-tydi_bengali_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1502452
num_examples: 8
download_size: 420489
dataset_size: 1502452
- config_name: mr-tydi_bengali_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 925581
num_examples: 4
download_size: 287677
dataset_size: 925581
- config_name: mr-tydi_english_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 88083413
num_examples: 1452
download_size: 49537360
dataset_size: 88083413
- config_name: mr-tydi_english_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 15769495
num_examples: 206
download_size: 8959983
dataset_size: 15769495
- config_name: mr-tydi_english_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1166656
num_examples: 15
download_size: 628836
dataset_size: 1166656
- config_name: mr-tydi_english_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 526285
num_examples: 5
download_size: 302625
dataset_size: 526285
- config_name: mr-tydi_english_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 341130
num_examples: 4
download_size: 193671
dataset_size: 341130
- config_name: mr-tydi_english_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 131725671
num_examples: 1864
download_size: 74886308
dataset_size: 131725671
- config_name: mr-tydi_english_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 118571
num_examples: 1
download_size: 78091
dataset_size: 118571
- config_name: mr-tydi_finnish_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 223439402
num_examples: 4152
download_size: 132585080
dataset_size: 223439402
- config_name: mr-tydi_finnish_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 16396022
num_examples: 264
download_size: 9741476
dataset_size: 16396022
- config_name: mr-tydi_finnish_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 5242522
num_examples: 84
download_size: 3052996
dataset_size: 5242522
- config_name: mr-tydi_finnish_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 139342
num_examples: 2
download_size: 86591
dataset_size: 139342
- config_name: mr-tydi_finnish_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 116902
num_examples: 2
download_size: 69754
dataset_size: 116902
- config_name: mr-tydi_finnish_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 124364870
num_examples: 2043
download_size: 74472384
dataset_size: 124364870
- config_name: mr-tydi_finnish_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 905421
num_examples: 10
download_size: 350451
dataset_size: 905421
- config_name: mr-tydi_finnish_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 310536
num_examples: 4
download_size: 145261
dataset_size: 310536
- config_name: mr-tydi_indonesian_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 141798318
num_examples: 2310
download_size: 77676550
dataset_size: 141798318
- config_name: mr-tydi_indonesian_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 30582758
num_examples: 415
download_size: 17064542
dataset_size: 30582758
- config_name: mr-tydi_indonesian_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 4960304
num_examples: 60
download_size: 2668392
dataset_size: 4960304
- config_name: mr-tydi_indonesian_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1183880
num_examples: 11
download_size: 418275
dataset_size: 1183880
- config_name: mr-tydi_indonesian_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 488368
num_examples: 5
download_size: 239425
dataset_size: 488368
- config_name: mr-tydi_indonesian_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 144492491
num_examples: 2098
download_size: 80151689
dataset_size: 144492491
- config_name: mr-tydi_indonesian_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 465861
num_examples: 3
download_size: 243140
dataset_size: 465861
- config_name: mr-tydi_japanese_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 66287105
num_examples: 1104
download_size: 37312208
dataset_size: 66287105
- config_name: mr-tydi_japanese_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 37895545
num_examples: 507
download_size: 21353355
dataset_size: 37895545
- config_name: mr-tydi_japanese_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 11177970
num_examples: 135
download_size: 6191742
dataset_size: 11177970
- config_name: mr-tydi_japanese_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 3898518
num_examples: 44
download_size: 2107551
dataset_size: 3898518
- config_name: mr-tydi_japanese_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 3385346
num_examples: 34
download_size: 1744962
dataset_size: 3385346
- config_name: mr-tydi_japanese_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 129362291
num_examples: 1857
download_size: 72929065
dataset_size: 129362291
- config_name: mr-tydi_japanese_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1031689
num_examples: 9
download_size: 531611
dataset_size: 1031689
- config_name: mr-tydi_japanese_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 780374
num_examples: 7
download_size: 300014
dataset_size: 780374
- config_name: mr-tydi_korean_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 17252978
num_examples: 293
download_size: 9847774
dataset_size: 17252978
- config_name: mr-tydi_korean_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 13308245
num_examples: 173
download_size: 7635761
dataset_size: 13308245
- config_name: mr-tydi_korean_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1874814
num_examples: 22
download_size: 983812
dataset_size: 1874814
- config_name: mr-tydi_korean_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 123949
num_examples: 1
download_size: 79337
dataset_size: 123949
- config_name: mr-tydi_korean_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 55548388
num_examples: 801
download_size: 32020512
dataset_size: 55548388
- config_name: mr-tydi_korean_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 506148
num_examples: 5
download_size: 204625
dataset_size: 506148
- config_name: mr-tydi_russian_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 107148401
num_examples: 1139
download_size: 52008602
dataset_size: 107148401
- config_name: mr-tydi_russian_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 96921758
num_examples: 787
download_size: 47648140
dataset_size: 96921758
- config_name: mr-tydi_russian_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 13253241
num_examples: 102
download_size: 6496787
dataset_size: 13253241
- config_name: mr-tydi_russian_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 5145699
num_examples: 41
download_size: 2500744
dataset_size: 5145699
- config_name: mr-tydi_russian_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 2115118
num_examples: 17
download_size: 941263
dataset_size: 2115118
- config_name: mr-tydi_russian_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 371247497
num_examples: 3264
download_size: 183091309
dataset_size: 371247497
- config_name: mr-tydi_russian_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1691246
num_examples: 11
download_size: 799287
dataset_size: 1691246
- config_name: mr-tydi_russian_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 486505
num_examples: 3
download_size: 239819
dataset_size: 486505
- config_name: mr-tydi_russian_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 336319
num_examples: 2
download_size: 169746
dataset_size: 336319
- config_name: mr-tydi_swahili_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 32623622
num_examples: 891
download_size: 18315585
dataset_size: 32623622
- config_name: mr-tydi_swahili_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 11660403
num_examples: 235
download_size: 6526721
dataset_size: 11660403
- config_name: mr-tydi_swahili_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 878141
num_examples: 20
download_size: 332091
dataset_size: 878141
- config_name: mr-tydi_swahili_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1549496
num_examples: 35
download_size: 446721
dataset_size: 1549496
- config_name: mr-tydi_swahili_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 37994264
num_examples: 891
download_size: 21852760
dataset_size: 37994264
- config_name: mr-tydi_telugu_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 3810906
num_examples: 30
download_size: 1342241
dataset_size: 3810906
- config_name: mr-tydi_telugu_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 506945921
num_examples: 2574
download_size: 155659518
dataset_size: 506945921
- config_name: mr-tydi_telugu_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 39824999
num_examples: 187
download_size: 13700530
dataset_size: 39824999
- config_name: mr-tydi_telugu_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 7929121
num_examples: 35
download_size: 2926610
dataset_size: 7929121
- config_name: mr-tydi_telugu_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1595960
num_examples: 7
download_size: 634904
dataset_size: 1595960
- config_name: mr-tydi_telugu_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 184695730
num_examples: 1039
download_size: 71256267
dataset_size: 184695730
- config_name: mr-tydi_telugu_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 2089040
num_examples: 8
download_size: 740666
dataset_size: 2089040
- config_name: mr-tydi_thai_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 114296856
num_examples: 843
download_size: 42487524
dataset_size: 114296856
- config_name: mr-tydi_thai_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 89508170
num_examples: 532
download_size: 33086505
dataset_size: 89508170
- config_name: mr-tydi_thai_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 27441702
num_examples: 151
download_size: 10173779
dataset_size: 27441702
- config_name: mr-tydi_thai_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 7604146
num_examples: 38
download_size: 2728986
dataset_size: 7604146
- config_name: mr-tydi_thai_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1261330
num_examples: 6
download_size: 438155
dataset_size: 1261330
- config_name: mr-tydi_thai_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 267519554
num_examples: 1740
download_size: 99804944
dataset_size: 267519554
- config_name: mr-tydi_thai_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 2323538
num_examples: 9
download_size: 554048
dataset_size: 2323538
- config_name: msmarco_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 36846421571
num_examples: 476968
download_size: 17203505423
dataset_size: 36846421571
- config_name: msmarco_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 221880847
num_examples: 2655
download_size: 110073583
dataset_size: 221880847
- config_name: msmarco_len-2000-3000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 76254058
num_examples: 859
download_size: 37992903
dataset_size: 76254058
- config_name: msmarco_len-3000-4000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 46382653
num_examples: 430
download_size: 21559992
dataset_size: 46382653
- config_name: msmarco_len-4000-5000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 13365567
num_examples: 120
download_size: 6017121
dataset_size: 13365567
- config_name: msmarco_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 335779279
num_examples: 4245
download_size: 167544770
dataset_size: 335779279
- config_name: msmarco_len-5000-6000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 7235495
num_examples: 72
download_size: 3556526
dataset_size: 7235495
- config_name: msmarco_len-6000-7000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 14032531
num_examples: 110
download_size: 6482430
dataset_size: 14032531
- config_name: msmarco_len-7000-inf
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 79268072
num_examples: 446
download_size: 42191834
dataset_size: 79268072
- config_name: nli_for_simcse_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 131931702
num_examples: 274951
download_size: 82321217
dataset_size: 131931702
- config_name: nq_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 3739882427
num_examples: 58554
download_size: 2078742331
dataset_size: 3739882427
- config_name: nq_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 905220
num_examples: 14
download_size: 449370
dataset_size: 905220
- config_name: pubmed_qa_labeled_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 4107058
num_examples: 500
download_size: 2303044
dataset_size: 4107058
- config_name: squad_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 512895039
num_examples: 85710
download_size: 295437582
dataset_size: 512895039
- config_name: squad_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 14482542
num_examples: 1889
download_size: 8514097
dataset_size: 14482542
- config_name: t2ranking_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 45761264
num_examples: 3837
download_size: 30374711
dataset_size: 45761264
- config_name: t2ranking_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
- name: pos_scores
list: float64
- name: neg_scores
list: float64
splits:
- name: train
num_bytes: 1645794162
num_examples: 86630
download_size: 1075778601
dataset_size: 1645794162
- config_name: trivia_len-0-500
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 3840129530
num_examples: 60283
download_size: 2039630774
dataset_size: 3840129530
- config_name: trivia_len-1000-2000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 461799
num_examples: 7
download_size: 121400
dataset_size: 461799
- config_name: trivia_len-500-1000
features:
- name: query
dtype: string
- name: pos
list: string
- name: neg
list: string
splits:
- name: train
num_bytes: 1645921
num_examples: 25
download_size: 836676
dataset_size: 1645921
configs:
- config_name: ATEC_len-0-500
data_files:
- split: train
path: ATEC_len-0-500/train-*
- config_name: BQ_len-0-500
data_files:
- split: train
path: BQ_len-0-500/train-*
- config_name: LCQMC_len-0-500
data_files:
- split: train
path: LCQMC_len-0-500/train-*
- config_name: PAWSX_len-0-500
data_files:
- split: train
path: PAWSX_len-0-500/train-*
- config_name: QBQTC_v2_len-0-500
data_files:
- split: train
path: QBQTC_v2_len-0-500/train-*
- config_name: STS-B_len-0-500
data_files:
- split: train
path: STS-B_len-0-500/train-*
- config_name: afqmc_len-0-500
data_files:
- split: train
path: afqmc_len-0-500/train-*
- config_name: cMedQAv2_len-0-500
data_files:
- split: train
path: cMedQAv2_len-0-500/train-*
- config_name: colliee_len-0-500
data_files:
- split: train
path: colliee_len-0-500/train-*
- config_name: dureader_len-0-500
data_files:
- split: train
path: dureader_len-0-500/train-*
- config_name: dureader_len-1000-2000
data_files:
- split: train
path: dureader_len-1000-2000/train-*
- config_name: dureader_len-2000-3000
data_files:
- split: train
path: dureader_len-2000-3000/train-*
- config_name: dureader_len-3000-4000
data_files:
- split: train
path: dureader_len-3000-4000/train-*
- config_name: dureader_len-4000-5000
data_files:
- split: train
path: dureader_len-4000-5000/train-*
- config_name: dureader_len-500-1000
data_files:
- split: train
path: dureader_len-500-1000/train-*
- config_name: dureader_len-5000-6000
data_files:
- split: train
path: dureader_len-5000-6000/train-*
- config_name: dureader_len-6000-7000
data_files:
- split: train
path: dureader_len-6000-7000/train-*
- config_name: dureader_len-7000-inf
data_files:
- split: train
path: dureader_len-7000-inf/train-*
- config_name: hotpotqa_len-0-500
data_files:
- split: train
path: hotpotqa_len-0-500/train-*
- config_name: hotpotqa_len-500-1000
data_files:
- split: train
path: hotpotqa_len-500-1000/train-*
- config_name: law_gpt_len-0-500
data_files:
- split: train
path: law_gpt_len-0-500/train-*
- config_name: lecardv2_len-7000-inf
data_files:
- split: train
path: lecardv2_len-7000-inf/train-*
- config_name: miracl_ar_len-0-500
data_files:
- split: train
path: miracl_ar_len-0-500/train-*
- config_name: miracl_ar_len-1000-2000
data_files:
- split: train
path: miracl_ar_len-1000-2000/train-*
- config_name: miracl_ar_len-2000-3000
data_files:
- split: train
path: miracl_ar_len-2000-3000/train-*
- config_name: miracl_ar_len-3000-4000
data_files:
- split: train
path: miracl_ar_len-3000-4000/train-*
- config_name: miracl_ar_len-4000-5000
data_files:
- split: train
path: miracl_ar_len-4000-5000/train-*
- config_name: miracl_ar_len-500-1000
data_files:
- split: train
path: miracl_ar_len-500-1000/train-*
- config_name: miracl_ar_len-5000-6000
data_files:
- split: train
path: miracl_ar_len-5000-6000/train-*
- config_name: miracl_ar_len-6000-7000
data_files:
- split: train
path: miracl_ar_len-6000-7000/train-*
- config_name: miracl_ar_len-7000-inf
data_files:
- split: train
path: miracl_ar_len-7000-inf/train-*
- config_name: miracl_bn_len-0-500
data_files:
- split: train
path: miracl_bn_len-0-500/train-*
- config_name: miracl_bn_len-1000-2000
data_files:
- split: train
path: miracl_bn_len-1000-2000/train-*
- config_name: miracl_bn_len-2000-3000
data_files:
- split: train
path: miracl_bn_len-2000-3000/train-*
- config_name: miracl_bn_len-3000-4000
data_files:
- split: train
path: miracl_bn_len-3000-4000/train-*
- config_name: miracl_bn_len-4000-5000
data_files:
- split: train
path: miracl_bn_len-4000-5000/train-*
- config_name: miracl_bn_len-500-1000
data_files:
- split: train
path: miracl_bn_len-500-1000/train-*
- config_name: miracl_bn_len-5000-6000
data_files:
- split: train
path: miracl_bn_len-5000-6000/train-*
- config_name: miracl_en_len-0-500
data_files:
- split: train
path: miracl_en_len-0-500/train-*
- config_name: miracl_en_len-1000-2000
data_files:
- split: train
path: miracl_en_len-1000-2000/train-*
- config_name: miracl_en_len-2000-3000
data_files:
- split: train
path: miracl_en_len-2000-3000/train-*
- config_name: miracl_en_len-3000-4000
data_files:
- split: train
path: miracl_en_len-3000-4000/train-*
- config_name: miracl_en_len-500-1000
data_files:
- split: train
path: miracl_en_len-500-1000/train-*
- config_name: miracl_es_len-0-500
data_files:
- split: train
path: miracl_es_len-0-500/train-*
- config_name: miracl_es_len-1000-2000
data_files:
- split: train
path: miracl_es_len-1000-2000/train-*
- config_name: miracl_es_len-2000-3000
data_files:
- split: train
path: miracl_es_len-2000-3000/train-*
- config_name: miracl_es_len-3000-4000
data_files:
- split: train
path: miracl_es_len-3000-4000/train-*
- config_name: miracl_es_len-4000-5000
data_files:
- split: train
path: miracl_es_len-4000-5000/train-*
- config_name: miracl_es_len-500-1000
data_files:
- split: train
path: miracl_es_len-500-1000/train-*
- config_name: miracl_fa_len-0-500
data_files:
- split: train
path: miracl_fa_len-0-500/train-*
- config_name: miracl_fa_len-1000-2000
data_files:
- split: train
path: miracl_fa_len-1000-2000/train-*
- config_name: miracl_fa_len-2000-3000
data_files:
- split: train
path: miracl_fa_len-2000-3000/train-*
- config_name: miracl_fa_len-3000-4000
data_files:
- split: train
path: miracl_fa_len-3000-4000/train-*
- config_name: miracl_fa_len-500-1000
data_files:
- split: train
path: miracl_fa_len-500-1000/train-*
- config_name: miracl_fa_len-7000-inf
data_files:
- split: train
path: miracl_fa_len-7000-inf/train-*
- config_name: miracl_fi_len-0-500
data_files:
- split: train
path: miracl_fi_len-0-500/train-*
- config_name: miracl_fi_len-1000-2000
data_files:
- split: train
path: miracl_fi_len-1000-2000/train-*
- config_name: miracl_fi_len-2000-3000
data_files:
- split: train
path: miracl_fi_len-2000-3000/train-*
- config_name: miracl_fi_len-500-1000
data_files:
- split: train
path: miracl_fi_len-500-1000/train-*
- config_name: miracl_fr_len-0-500
data_files:
- split: train
path: miracl_fr_len-0-500/train-*
- config_name: miracl_fr_len-1000-2000
data_files:
- split: train
path: miracl_fr_len-1000-2000/train-*
- config_name: miracl_fr_len-2000-3000
data_files:
- split: train
path: miracl_fr_len-2000-3000/train-*
- config_name: miracl_fr_len-500-1000
data_files:
- split: train
path: miracl_fr_len-500-1000/train-*
- config_name: miracl_hi_len-0-500
data_files:
- split: train
path: miracl_hi_len-0-500/train-*
- config_name: miracl_hi_len-1000-2000
data_files:
- split: train
path: miracl_hi_len-1000-2000/train-*
- config_name: miracl_hi_len-2000-3000
data_files:
- split: train
path: miracl_hi_len-2000-3000/train-*
- config_name: miracl_hi_len-3000-4000
data_files:
- split: train
path: miracl_hi_len-3000-4000/train-*
- config_name: miracl_hi_len-4000-5000
data_files:
- split: train
path: miracl_hi_len-4000-5000/train-*
- config_name: miracl_hi_len-500-1000
data_files:
- split: train
path: miracl_hi_len-500-1000/train-*
- config_name: miracl_hi_len-5000-6000
data_files:
- split: train
path: miracl_hi_len-5000-6000/train-*
- config_name: miracl_hi_len-7000-inf
data_files:
- split: train
path: miracl_hi_len-7000-inf/train-*
- config_name: miracl_id_len-0-500
data_files:
- split: train
path: miracl_id_len-0-500/train-*
- config_name: miracl_id_len-1000-2000
data_files:
- split: train
path: miracl_id_len-1000-2000/train-*
- config_name: miracl_id_len-2000-3000
data_files:
- split: train
path: miracl_id_len-2000-3000/train-*
- config_name: miracl_id_len-3000-4000
data_files:
- split: train
path: miracl_id_len-3000-4000/train-*
- config_name: miracl_id_len-500-1000
data_files:
- split: train
path: miracl_id_len-500-1000/train-*
- config_name: miracl_ja_len-0-500
data_files:
- split: train
path: miracl_ja_len-0-500/train-*
- config_name: miracl_ja_len-1000-2000
data_files:
- split: train
path: miracl_ja_len-1000-2000/train-*
- config_name: miracl_ja_len-2000-3000
data_files:
- split: train
path: miracl_ja_len-2000-3000/train-*
- config_name: miracl_ja_len-3000-4000
data_files:
- split: train
path: miracl_ja_len-3000-4000/train-*
- config_name: miracl_ja_len-4000-5000
data_files:
- split: train
path: miracl_ja_len-4000-5000/train-*
- config_name: miracl_ja_len-500-1000
data_files:
- split: train
path: miracl_ja_len-500-1000/train-*
- config_name: miracl_ja_len-6000-7000
data_files:
- split: train
path: miracl_ja_len-6000-7000/train-*
- config_name: miracl_ja_len-7000-inf
data_files:
- split: train
path: miracl_ja_len-7000-inf/train-*
- config_name: miracl_ko_len-0-500
data_files:
- split: train
path: miracl_ko_len-0-500/train-*
- config_name: miracl_ko_len-1000-2000
data_files:
- split: train
path: miracl_ko_len-1000-2000/train-*
- config_name: miracl_ko_len-2000-3000
data_files:
- split: train
path: miracl_ko_len-2000-3000/train-*
- config_name: miracl_ko_len-3000-4000
data_files:
- split: train
path: miracl_ko_len-3000-4000/train-*
- config_name: miracl_ko_len-500-1000
data_files:
- split: train
path: miracl_ko_len-500-1000/train-*
- config_name: miracl_ko_len-5000-6000
data_files:
- split: train
path: miracl_ko_len-5000-6000/train-*
- config_name: miracl_ko_len-7000-inf
data_files:
- split: train
path: miracl_ko_len-7000-inf/train-*
- config_name: miracl_ru_len-0-500
data_files:
- split: train
path: miracl_ru_len-0-500/train-*
- config_name: miracl_ru_len-1000-2000
data_files:
- split: train
path: miracl_ru_len-1000-2000/train-*
- config_name: miracl_ru_len-2000-3000
data_files:
- split: train
path: miracl_ru_len-2000-3000/train-*
- config_name: miracl_ru_len-3000-4000
data_files:
- split: train
path: miracl_ru_len-3000-4000/train-*
- config_name: miracl_ru_len-4000-5000
data_files:
- split: train
path: miracl_ru_len-4000-5000/train-*
- config_name: miracl_ru_len-500-1000
data_files:
- split: train
path: miracl_ru_len-500-1000/train-*
- config_name: miracl_ru_len-7000-inf
data_files:
- split: train
path: miracl_ru_len-7000-inf/train-*
- config_name: miracl_sw_len-0-500
data_files:
- split: train
path: miracl_sw_len-0-500/train-*
- config_name: miracl_sw_len-1000-2000
data_files:
- split: train
path: miracl_sw_len-1000-2000/train-*
- config_name: miracl_sw_len-3000-4000
data_files:
- split: train
path: miracl_sw_len-3000-4000/train-*
- config_name: miracl_sw_len-500-1000
data_files:
- split: train
path: miracl_sw_len-500-1000/train-*
- config_name: miracl_te_len-0-500
data_files:
- split: train
path: miracl_te_len-0-500/train-*
- config_name: miracl_te_len-1000-2000
data_files:
- split: train
path: miracl_te_len-1000-2000/train-*
- config_name: miracl_te_len-2000-3000
data_files:
- split: train
path: miracl_te_len-2000-3000/train-*
- config_name: miracl_te_len-3000-4000
data_files:
- split: train
path: miracl_te_len-3000-4000/train-*
- config_name: miracl_te_len-4000-5000
data_files:
- split: train
path: miracl_te_len-4000-5000/train-*
- config_name: miracl_te_len-500-1000
data_files:
- split: train
path: miracl_te_len-500-1000/train-*
- config_name: miracl_te_len-5000-6000
data_files:
- split: train
path: miracl_te_len-5000-6000/train-*
- config_name: miracl_th_len-0-500
data_files:
- split: train
path: miracl_th_len-0-500/train-*
- config_name: miracl_th_len-1000-2000
data_files:
- split: train
path: miracl_th_len-1000-2000/train-*
- config_name: miracl_th_len-2000-3000
data_files:
- split: train
path: miracl_th_len-2000-3000/train-*
- config_name: miracl_th_len-3000-4000
data_files:
- split: train
path: miracl_th_len-3000-4000/train-*
- config_name: miracl_th_len-4000-5000
data_files:
- split: train
path: miracl_th_len-4000-5000/train-*
- config_name: miracl_th_len-500-1000
data_files:
- split: train
path: miracl_th_len-500-1000/train-*
- config_name: miracl_zh_len-0-500
data_files:
- split: train
path: miracl_zh_len-0-500/train-*
- config_name: miracl_zh_len-1000-2000
data_files:
- split: train
path: miracl_zh_len-1000-2000/train-*
- config_name: miracl_zh_len-2000-3000
data_files:
- split: train
path: miracl_zh_len-2000-3000/train-*
- config_name: miracl_zh_len-500-1000
data_files:
- split: train
path: miracl_zh_len-500-1000/train-*
- config_name: mldr_ar_len-4000-5000
data_files:
- split: train
path: mldr_ar_len-4000-5000/train-*
- config_name: mldr_ar_len-5000-6000
data_files:
- split: train
path: mldr_ar_len-5000-6000/train-*
- config_name: mldr_ar_len-6000-7000
data_files:
- split: train
path: mldr_ar_len-6000-7000/train-*
- config_name: mldr_ar_len-7000-inf
data_files:
- split: train
path: mldr_ar_len-7000-inf/train-*
- config_name: mldr_de_len-5000-6000
data_files:
- split: train
path: mldr_de_len-5000-6000/train-*
- config_name: mldr_de_len-6000-7000
data_files:
- split: train
path: mldr_de_len-6000-7000/train-*
- config_name: mldr_de_len-7000-inf
data_files:
- split: train
path: mldr_de_len-7000-inf/train-*
- config_name: mldr_en_len-2000-3000
data_files:
- split: train
path: mldr_en_len-2000-3000/train-*
- config_name: mldr_en_len-3000-4000
data_files:
- split: train
path: mldr_en_len-3000-4000/train-*
- config_name: mldr_en_len-4000-5000
data_files:
- split: train
path: mldr_en_len-4000-5000/train-*
- config_name: mldr_en_len-5000-6000
data_files:
- split: train
path: mldr_en_len-5000-6000/train-*
- config_name: mldr_en_len-6000-7000
data_files:
- split: train
path: mldr_en_len-6000-7000/train-*
- config_name: mldr_en_len-7000-inf
data_files:
- split: train
path: mldr_en_len-7000-inf/train-*
- config_name: mldr_es_len-4000-5000
data_files:
- split: train
path: mldr_es_len-4000-5000/train-*
- config_name: mldr_es_len-5000-6000
data_files:
- split: train
path: mldr_es_len-5000-6000/train-*
- config_name: mldr_es_len-6000-7000
data_files:
- split: train
path: mldr_es_len-6000-7000/train-*
- config_name: mldr_es_len-7000-inf
data_files:
- split: train
path: mldr_es_len-7000-inf/train-*
- config_name: mldr_fr_len-4000-5000
data_files:
- split: train
path: mldr_fr_len-4000-5000/train-*
- config_name: mldr_fr_len-5000-6000
data_files:
- split: train
path: mldr_fr_len-5000-6000/train-*
- config_name: mldr_fr_len-6000-7000
data_files:
- split: train
path: mldr_fr_len-6000-7000/train-*
- config_name: mldr_fr_len-7000-inf
data_files:
- split: train
path: mldr_fr_len-7000-inf/train-*
- config_name: mldr_hi_len-5000-6000
data_files:
- split: train
path: mldr_hi_len-5000-6000/train-*
- config_name: mldr_hi_len-6000-7000
data_files:
- split: train
path: mldr_hi_len-6000-7000/train-*
- config_name: mldr_hi_len-7000-inf
data_files:
- split: train
path: mldr_hi_len-7000-inf/train-*
- config_name: mldr_it_len-5000-6000
data_files:
- split: train
path: mldr_it_len-5000-6000/train-*
- config_name: mldr_it_len-6000-7000
data_files:
- split: train
path: mldr_it_len-6000-7000/train-*
- config_name: mldr_it_len-7000-inf
data_files:
- split: train
path: mldr_it_len-7000-inf/train-*
- config_name: mldr_ja_len-2000-3000
data_files:
- split: train
path: mldr_ja_len-2000-3000/train-*
- config_name: mldr_ja_len-4000-5000
data_files:
- split: train
path: mldr_ja_len-4000-5000/train-*
- config_name: mldr_ja_len-5000-6000
data_files:
- split: train
path: mldr_ja_len-5000-6000/train-*
- config_name: mldr_ja_len-6000-7000
data_files:
- split: train
path: mldr_ja_len-6000-7000/train-*
- config_name: mldr_ja_len-7000-inf
data_files:
- split: train
path: mldr_ja_len-7000-inf/train-*
- config_name: mldr_ko_len-3000-4000
data_files:
- split: train
path: mldr_ko_len-3000-4000/train-*
- config_name: mldr_ko_len-4000-5000
data_files:
- split: train
path: mldr_ko_len-4000-5000/train-*
- config_name: mldr_ko_len-5000-6000
data_files:
- split: train
path: mldr_ko_len-5000-6000/train-*
- config_name: mldr_ko_len-6000-7000
data_files:
- split: train
path: mldr_ko_len-6000-7000/train-*
- config_name: mldr_ko_len-7000-inf
data_files:
- split: train
path: mldr_ko_len-7000-inf/train-*
- config_name: mldr_pt_len-5000-6000
data_files:
- split: train
path: mldr_pt_len-5000-6000/train-*
- config_name: mldr_pt_len-6000-7000
data_files:
- split: train
path: mldr_pt_len-6000-7000/train-*
- config_name: mldr_pt_len-7000-inf
data_files:
- split: train
path: mldr_pt_len-7000-inf/train-*
- config_name: mldr_ru_len-3000-4000
data_files:
- split: train
path: mldr_ru_len-3000-4000/train-*
- config_name: mldr_ru_len-5000-6000
data_files:
- split: train
path: mldr_ru_len-5000-6000/train-*
- config_name: mldr_ru_len-6000-7000
data_files:
- split: train
path: mldr_ru_len-6000-7000/train-*
- config_name: mldr_ru_len-7000-inf
data_files:
- split: train
path: mldr_ru_len-7000-inf/train-*
- config_name: mldr_th_len-1000-2000
data_files:
- split: train
path: mldr_th_len-1000-2000/train-*
- config_name: mldr_th_len-2000-3000
data_files:
- split: train
path: mldr_th_len-2000-3000/train-*
- config_name: mldr_th_len-3000-4000
data_files:
- split: train
path: mldr_th_len-3000-4000/train-*
- config_name: mldr_th_len-4000-5000
data_files:
- split: train
path: mldr_th_len-4000-5000/train-*
- config_name: mldr_th_len-5000-6000
data_files:
- split: train
path: mldr_th_len-5000-6000/train-*
- config_name: mldr_th_len-6000-7000
data_files:
- split: train
path: mldr_th_len-6000-7000/train-*
- config_name: mldr_th_len-7000-inf
data_files:
- split: train
path: mldr_th_len-7000-inf/train-*
- config_name: mldr_zh_len-1000-2000
data_files:
- split: train
path: mldr_zh_len-1000-2000/train-*
- config_name: mldr_zh_len-2000-3000
data_files:
- split: train
path: mldr_zh_len-2000-3000/train-*
- config_name: mldr_zh_len-3000-4000
data_files:
- split: train
path: mldr_zh_len-3000-4000/train-*
- config_name: mldr_zh_len-4000-5000
data_files:
- split: train
path: mldr_zh_len-4000-5000/train-*
- config_name: mldr_zh_len-5000-6000
data_files:
- split: train
path: mldr_zh_len-5000-6000/train-*
- config_name: mldr_zh_len-6000-7000
data_files:
- split: train
path: mldr_zh_len-6000-7000/train-*
- config_name: mldr_zh_len-7000-inf
data_files:
- split: train
path: mldr_zh_len-7000-inf/train-*
- config_name: mmarco_chinese_len-0-500
data_files:
- split: train
path: mmarco_chinese_len-0-500/train-*
- config_name: mr-tydi_arabic_len-0-500
data_files:
- split: train
path: mr-tydi_arabic_len-0-500/train-*
- config_name: mr-tydi_arabic_len-1000-2000
data_files:
- split: train
path: mr-tydi_arabic_len-1000-2000/train-*
- config_name: mr-tydi_arabic_len-2000-3000
data_files:
- split: train
path: mr-tydi_arabic_len-2000-3000/train-*
- config_name: mr-tydi_arabic_len-3000-4000
data_files:
- split: train
path: mr-tydi_arabic_len-3000-4000/train-*
- config_name: mr-tydi_arabic_len-4000-5000
data_files:
- split: train
path: mr-tydi_arabic_len-4000-5000/train-*
- config_name: mr-tydi_arabic_len-500-1000
data_files:
- split: train
path: mr-tydi_arabic_len-500-1000/train-*
- config_name: mr-tydi_arabic_len-5000-6000
data_files:
- split: train
path: mr-tydi_arabic_len-5000-6000/train-*
- config_name: mr-tydi_arabic_len-6000-7000
data_files:
- split: train
path: mr-tydi_arabic_len-6000-7000/train-*
- config_name: mr-tydi_arabic_len-7000-inf
data_files:
- split: train
path: mr-tydi_arabic_len-7000-inf/train-*
- config_name: mr-tydi_bengali_len-0-500
data_files:
- split: train
path: mr-tydi_bengali_len-0-500/train-*
- config_name: mr-tydi_bengali_len-1000-2000
data_files:
- split: train
path: mr-tydi_bengali_len-1000-2000/train-*
- config_name: mr-tydi_bengali_len-2000-3000
data_files:
- split: train
path: mr-tydi_bengali_len-2000-3000/train-*
- config_name: mr-tydi_bengali_len-3000-4000
data_files:
- split: train
path: mr-tydi_bengali_len-3000-4000/train-*
- config_name: mr-tydi_bengali_len-4000-5000
data_files:
- split: train
path: mr-tydi_bengali_len-4000-5000/train-*
- config_name: mr-tydi_bengali_len-500-1000
data_files:
- split: train
path: mr-tydi_bengali_len-500-1000/train-*
- config_name: mr-tydi_bengali_len-5000-6000
data_files:
- split: train
path: mr-tydi_bengali_len-5000-6000/train-*
- config_name: mr-tydi_bengali_len-6000-7000
data_files:
- split: train
path: mr-tydi_bengali_len-6000-7000/train-*
- config_name: mr-tydi_english_len-0-500
data_files:
- split: train
path: mr-tydi_english_len-0-500/train-*
- config_name: mr-tydi_english_len-1000-2000
data_files:
- split: train
path: mr-tydi_english_len-1000-2000/train-*
- config_name: mr-tydi_english_len-2000-3000
data_files:
- split: train
path: mr-tydi_english_len-2000-3000/train-*
- config_name: mr-tydi_english_len-3000-4000
data_files:
- split: train
path: mr-tydi_english_len-3000-4000/train-*
- config_name: mr-tydi_english_len-4000-5000
data_files:
- split: train
path: mr-tydi_english_len-4000-5000/train-*
- config_name: mr-tydi_english_len-500-1000
data_files:
- split: train
path: mr-tydi_english_len-500-1000/train-*
- config_name: mr-tydi_english_len-5000-6000
data_files:
- split: train
path: mr-tydi_english_len-5000-6000/train-*
- config_name: mr-tydi_finnish_len-0-500
data_files:
- split: train
path: mr-tydi_finnish_len-0-500/train-*
- config_name: mr-tydi_finnish_len-1000-2000
data_files:
- split: train
path: mr-tydi_finnish_len-1000-2000/train-*
- config_name: mr-tydi_finnish_len-2000-3000
data_files:
- split: train
path: mr-tydi_finnish_len-2000-3000/train-*
- config_name: mr-tydi_finnish_len-3000-4000
data_files:
- split: train
path: mr-tydi_finnish_len-3000-4000/train-*
- config_name: mr-tydi_finnish_len-4000-5000
data_files:
- split: train
path: mr-tydi_finnish_len-4000-5000/train-*
- config_name: mr-tydi_finnish_len-500-1000
data_files:
- split: train
path: mr-tydi_finnish_len-500-1000/train-*
- config_name: mr-tydi_finnish_len-6000-7000
data_files:
- split: train
path: mr-tydi_finnish_len-6000-7000/train-*
- config_name: mr-tydi_finnish_len-7000-inf
data_files:
- split: train
path: mr-tydi_finnish_len-7000-inf/train-*
- config_name: mr-tydi_indonesian_len-0-500
data_files:
- split: train
path: mr-tydi_indonesian_len-0-500/train-*
- config_name: mr-tydi_indonesian_len-1000-2000
data_files:
- split: train
path: mr-tydi_indonesian_len-1000-2000/train-*
- config_name: mr-tydi_indonesian_len-2000-3000
data_files:
- split: train
path: mr-tydi_indonesian_len-2000-3000/train-*
- config_name: mr-tydi_indonesian_len-3000-4000
data_files:
- split: train
path: mr-tydi_indonesian_len-3000-4000/train-*
- config_name: mr-tydi_indonesian_len-4000-5000
data_files:
- split: train
path: mr-tydi_indonesian_len-4000-5000/train-*
- config_name: mr-tydi_indonesian_len-500-1000
data_files:
- split: train
path: mr-tydi_indonesian_len-500-1000/train-*
- config_name: mr-tydi_indonesian_len-5000-6000
data_files:
- split: train
path: mr-tydi_indonesian_len-5000-6000/train-*
- config_name: mr-tydi_japanese_len-0-500
data_files:
- split: train
path: mr-tydi_japanese_len-0-500/train-*
- config_name: mr-tydi_japanese_len-1000-2000
data_files:
- split: train
path: mr-tydi_japanese_len-1000-2000/train-*
- config_name: mr-tydi_japanese_len-2000-3000
data_files:
- split: train
path: mr-tydi_japanese_len-2000-3000/train-*
- config_name: mr-tydi_japanese_len-3000-4000
data_files:
- split: train
path: mr-tydi_japanese_len-3000-4000/train-*
- config_name: mr-tydi_japanese_len-4000-5000
data_files:
- split: train
path: mr-tydi_japanese_len-4000-5000/train-*
- config_name: mr-tydi_japanese_len-500-1000
data_files:
- split: train
path: mr-tydi_japanese_len-500-1000/train-*
- config_name: mr-tydi_japanese_len-5000-6000
data_files:
- split: train
path: mr-tydi_japanese_len-5000-6000/train-*
- config_name: mr-tydi_japanese_len-6000-7000
data_files:
- split: train
path: mr-tydi_japanese_len-6000-7000/train-*
- config_name: mr-tydi_korean_len-0-500
data_files:
- split: train
path: mr-tydi_korean_len-0-500/train-*
- config_name: mr-tydi_korean_len-1000-2000
data_files:
- split: train
path: mr-tydi_korean_len-1000-2000/train-*
- config_name: mr-tydi_korean_len-2000-3000
data_files:
- split: train
path: mr-tydi_korean_len-2000-3000/train-*
- config_name: mr-tydi_korean_len-3000-4000
data_files:
- split: train
path: mr-tydi_korean_len-3000-4000/train-*
- config_name: mr-tydi_korean_len-500-1000
data_files:
- split: train
path: mr-tydi_korean_len-500-1000/train-*
- config_name: mr-tydi_korean_len-7000-inf
data_files:
- split: train
path: mr-tydi_korean_len-7000-inf/train-*
- config_name: mr-tydi_russian_len-0-500
data_files:
- split: train
path: mr-tydi_russian_len-0-500/train-*
- config_name: mr-tydi_russian_len-1000-2000
data_files:
- split: train
path: mr-tydi_russian_len-1000-2000/train-*
- config_name: mr-tydi_russian_len-2000-3000
data_files:
- split: train
path: mr-tydi_russian_len-2000-3000/train-*
- config_name: mr-tydi_russian_len-3000-4000
data_files:
- split: train
path: mr-tydi_russian_len-3000-4000/train-*
- config_name: mr-tydi_russian_len-4000-5000
data_files:
- split: train
path: mr-tydi_russian_len-4000-5000/train-*
- config_name: mr-tydi_russian_len-500-1000
data_files:
- split: train
path: mr-tydi_russian_len-500-1000/train-*
- config_name: mr-tydi_russian_len-5000-6000
data_files:
- split: train
path: mr-tydi_russian_len-5000-6000/train-*
- config_name: mr-tydi_russian_len-6000-7000
data_files:
- split: train
path: mr-tydi_russian_len-6000-7000/train-*
- config_name: mr-tydi_russian_len-7000-inf
data_files:
- split: train
path: mr-tydi_russian_len-7000-inf/train-*
- config_name: mr-tydi_swahili_len-0-500
data_files:
- split: train
path: mr-tydi_swahili_len-0-500/train-*
- config_name: mr-tydi_swahili_len-1000-2000
data_files:
- split: train
path: mr-tydi_swahili_len-1000-2000/train-*
- config_name: mr-tydi_swahili_len-2000-3000
data_files:
- split: train
path: mr-tydi_swahili_len-2000-3000/train-*
- config_name: mr-tydi_swahili_len-3000-4000
data_files:
- split: train
path: mr-tydi_swahili_len-3000-4000/train-*
- config_name: mr-tydi_swahili_len-500-1000
data_files:
- split: train
path: mr-tydi_swahili_len-500-1000/train-*
- config_name: mr-tydi_telugu_len-0-500
data_files:
- split: train
path: mr-tydi_telugu_len-0-500/train-*
- config_name: mr-tydi_telugu_len-1000-2000
data_files:
- split: train
path: mr-tydi_telugu_len-1000-2000/train-*
- config_name: mr-tydi_telugu_len-2000-3000
data_files:
- split: train
path: mr-tydi_telugu_len-2000-3000/train-*
- config_name: mr-tydi_telugu_len-3000-4000
data_files:
- split: train
path: mr-tydi_telugu_len-3000-4000/train-*
- config_name: mr-tydi_telugu_len-4000-5000
data_files:
- split: train
path: mr-tydi_telugu_len-4000-5000/train-*
- config_name: mr-tydi_telugu_len-500-1000
data_files:
- split: train
path: mr-tydi_telugu_len-500-1000/train-*
- config_name: mr-tydi_telugu_len-5000-6000
data_files:
- split: train
path: mr-tydi_telugu_len-5000-6000/train-*
- config_name: mr-tydi_thai_len-0-500
data_files:
- split: train
path: mr-tydi_thai_len-0-500/train-*
- config_name: mr-tydi_thai_len-1000-2000
data_files:
- split: train
path: mr-tydi_thai_len-1000-2000/train-*
- config_name: mr-tydi_thai_len-2000-3000
data_files:
- split: train
path: mr-tydi_thai_len-2000-3000/train-*
- config_name: mr-tydi_thai_len-3000-4000
data_files:
- split: train
path: mr-tydi_thai_len-3000-4000/train-*
- config_name: mr-tydi_thai_len-4000-5000
data_files:
- split: train
path: mr-tydi_thai_len-4000-5000/train-*
- config_name: mr-tydi_thai_len-500-1000
data_files:
- split: train
path: mr-tydi_thai_len-500-1000/train-*
- config_name: mr-tydi_thai_len-5000-6000
data_files:
- split: train
path: mr-tydi_thai_len-5000-6000/train-*
- config_name: msmarco_len-0-500
data_files:
- split: train
path: msmarco_len-0-500/train-*
- config_name: msmarco_len-1000-2000
data_files:
- split: train
path: msmarco_len-1000-2000/train-*
- config_name: msmarco_len-2000-3000
data_files:
- split: train
path: msmarco_len-2000-3000/train-*
- config_name: msmarco_len-3000-4000
data_files:
- split: train
path: msmarco_len-3000-4000/train-*
- config_name: msmarco_len-4000-5000
data_files:
- split: train
path: msmarco_len-4000-5000/train-*
- config_name: msmarco_len-500-1000
data_files:
- split: train
path: msmarco_len-500-1000/train-*
- config_name: msmarco_len-5000-6000
data_files:
- split: train
path: msmarco_len-5000-6000/train-*
- config_name: msmarco_len-6000-7000
data_files:
- split: train
path: msmarco_len-6000-7000/train-*
- config_name: msmarco_len-7000-inf
data_files:
- split: train
path: msmarco_len-7000-inf/train-*
- config_name: nli_for_simcse_len-0-500
data_files:
- split: train
path: nli_for_simcse_len-0-500/train-*
- config_name: nq_len-0-500
data_files:
- split: train
path: nq_len-0-500/train-*
- config_name: nq_len-500-1000
data_files:
- split: train
path: nq_len-500-1000/train-*
- config_name: pubmed_qa_labeled_len-0-500
data_files:
- split: train
path: pubmed_qa_labeled_len-0-500/train-*
- config_name: squad_len-0-500
data_files:
- split: train
path: squad_len-0-500/train-*
- config_name: squad_len-500-1000
data_files:
- split: train
path: squad_len-500-1000/train-*
- config_name: t2ranking_len-0-500
data_files:
- split: train
path: t2ranking_len-0-500/train-*
- config_name: t2ranking_len-500-1000
data_files:
- split: train
path: t2ranking_len-500-1000/train-*
- config_name: trivia_len-0-500
data_files:
- split: train
path: trivia_len-0-500/train-*
- config_name: trivia_len-1000-2000
data_files:
- split: train
path: trivia_len-1000-2000/train-*
- config_name: trivia_len-500-1000
data_files:
- split: train
path: trivia_len-500-1000/train-*
---
# hotchpotch/bge-m3-data-finetune-unified
Mirror of the original `Shitao/bge-m3-data` corpus, repackaged from gzip-compressed JSONL shards into Hugging Face Dataset confi
gs stored as Parquet. Content is unchanged; this repo only standardizes the storage format and centralizes all length buckets un
der one dataset namespace.
By using this dataset, which includes up to seven hard negative texts per query (as in the original bge-m3 training data), you can easily fine-tune retrieval models or other search systems.
## Provenance and conversion
- Source: [Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data), which hosts the official fine-tuning data fo
r the BGE-M3 model.
- Formatting changes: Only the container format changed (JSONL → HF Dataset → Parquet on the Hub). Field names and values are id
entical to the source.
## Schema
Each config keeps the original triplet fields:
- `query`: string
- `pos`: list of positive passages (strings)
- `neg`: list of negative passages (strings)
## Using the dataset
```python
from datasets import load_dataset
ds = load_dataset("hotchpotch/bge-m3-data-finetune-unified", "miracl_en_len-0-500")
row = ds["train"][0]
preview = lambda t: t.replace("\n", " ")[:50] + ("..." if len(t) > 50 else "")
print("query:", row["query"])
print("pos[0]:", preview(row["pos"][0]))
for i, n in enumerate(row["neg"][:3]): # show first 3 negatives
print(f"neg[{i}]:", preview(n))
```
Example output:
```
query: When was quantum field theory developed?
pos[0]: History of quantum field theory The third thread i...
neg[0]: AdS/CFT correspondence In quantum field theory, on...
neg[1]: Condensed matter physics The Sommerfeld model and ...
neg[2]: Quantum configuration space In quantum field theor...
```
# Original Dataset Summary
This depository contains all the fine-tuning data for the [bge-m3](https://huggingface.co/BAAI/bge-m3) model, including:
| Dataset | Language |
| --------------- | :----------: |
| MS MARCO | English |
| NQ | English |
| HotpotQA | English |
| TriviaQA | English |
| SQuAD | English |
| COLIEE | English |
| PubMedQA | English |
| NLI from SimCSE | English |
| DuReader | Chinese |
| mMARCO-zh | Chinese |
| T2Ranking | Chinese |
| Law-GPT | Chinese |
| cMedQAv2 | Chinese |
| NLI-zh | Chinese |
| LeCaRDv2 | Chinese |
| Mr.TyDi | 11 languages |
| MIRACL | 16 languages |
| MLDR | 13 languages |
Note: The MLDR dataset here is the handled `train` set of the [MLDR dataset](https://huggingface.co/datasets/Shitao/MLDR).
For more details, please refer to our [paper](https://arxiv.org/pdf/2402.03216.pdf).
# Dataset Structure
Each dataset has been split into multiple files according to the tokenized length of the text (tokenizer of bge-m3, i.e. tokenizer of [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)). For example, the MS MARCO dataset has been split into 8 files: `msmarco_len-0-500.jsonl`, `msmarco_len-500-1000.jsonl`, ..., `msmarco_len-6000-7000.jsonl`, `msmarco_len-7000-inf.jsonl`. All the files are in the `jsonl` format. Each line of the file is a json object. The following is an example of the json object:
```python
{"query": str, "pos": List[str], "neg":List[str]}
```
# Citation Information
```
@misc{bge-m3,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
year={2024},
eprint={2402.03216},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
提供机构:
hotchpotch



