five

hotchpotch/bge-m3-data-finetune-unified

收藏
Hugging Face2025-12-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hotchpotch/bge-m3-data-finetune-unified
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: ATEC_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 5970006 num_examples: 11325 download_size: 2923950 dataset_size: 5970006 - config_name: BQ_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 6220822 num_examples: 12599 download_size: 1239675 dataset_size: 6220822 - config_name: LCQMC_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 4484444 num_examples: 10000 download_size: 2878253 dataset_size: 4484444 - config_name: PAWSX_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 11027815 num_examples: 10000 download_size: 7932138 dataset_size: 11027815 - config_name: QBQTC_v2_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 7874062 num_examples: 10000 download_size: 5521390 dataset_size: 7874062 - config_name: STS-B_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 173179 num_examples: 249 download_size: 106741 dataset_size: 173179 - config_name: afqmc_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 5575132 num_examples: 10534 download_size: 2486919 dataset_size: 5575132 - config_name: cMedQAv2_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 1506313240 num_examples: 50000 download_size: 526296863 dataset_size: 1506313240 - config_name: colliee_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 5782070 num_examples: 463 download_size: 267643 dataset_size: 5782070 - config_name: dureader_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 433283844 num_examples: 35172 download_size: 286918744 dataset_size: 433283844 - config_name: dureader_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 275880594 num_examples: 13545 download_size: 176219623 dataset_size: 275880594 - config_name: dureader_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 60711238 num_examples: 2344 download_size: 38473060 dataset_size: 60711238 - config_name: dureader_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 19261817 num_examples: 643 download_size: 11863758 dataset_size: 19261817 - config_name: dureader_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 7228271 num_examples: 198 download_size: 4153618 dataset_size: 7228271 - config_name: dureader_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 447803437 num_examples: 28124 download_size: 291713807 dataset_size: 447803437 - config_name: dureader_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 5375932 num_examples: 140 download_size: 3009507 dataset_size: 5375932 - config_name: dureader_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 3174783 num_examples: 68 download_size: 1601924 dataset_size: 3174783 - config_name: dureader_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 11348713 num_examples: 182 download_size: 6203652 dataset_size: 11348713 - config_name: hotpotqa_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 747412404 num_examples: 84228 download_size: 443806134 dataset_size: 747412404 - config_name: hotpotqa_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 3454175 num_examples: 288 download_size: 1945372 dataset_size: 3454175 - config_name: law_gpt_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 17139477 num_examples: 500 download_size: 5342913 dataset_size: 17139477 - config_name: lecardv2_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 181165743 num_examples: 591 download_size: 83120086 dataset_size: 181165743 - config_name: miracl_ar_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 33660840 num_examples: 422 download_size: 16063554 dataset_size: 33660840 - config_name: miracl_ar_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 96911593 num_examples: 885 download_size: 47441929 dataset_size: 96911593 - config_name: miracl_ar_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 13308541 num_examples: 111 download_size: 6591267 dataset_size: 13308541 - config_name: miracl_ar_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 3925681 num_examples: 33 download_size: 1947250 dataset_size: 3925681 - config_name: miracl_ar_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1588053 num_examples: 13 download_size: 683473 dataset_size: 1588053 - config_name: miracl_ar_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 194519986 num_examples: 2012 download_size: 94534205 dataset_size: 194519986 - config_name: miracl_ar_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1101307 num_examples: 8 download_size: 479455 dataset_size: 1101307 - config_name: miracl_ar_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 384386 num_examples: 3 download_size: 150461 dataset_size: 384386 - config_name: miracl_ar_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1356440 num_examples: 8 download_size: 603263 dataset_size: 1356440 - config_name: miracl_bn_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 13751152 num_examples: 98 download_size: 4991617 dataset_size: 13751152 - config_name: miracl_bn_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 81634211 num_examples: 451 download_size: 30221169 dataset_size: 81634211 - config_name: miracl_bn_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 6226270 num_examples: 34 download_size: 2363654 dataset_size: 6226270 - config_name: miracl_bn_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 4077505 num_examples: 20 download_size: 1493164 dataset_size: 4077505 - config_name: miracl_bn_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 3006630 num_examples: 13 download_size: 1009933 dataset_size: 3006630 - config_name: miracl_bn_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 167716118 num_examples: 1008 download_size: 61905762 dataset_size: 167716118 - config_name: miracl_bn_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1391958 num_examples: 7 download_size: 415031 dataset_size: 1391958 - config_name: miracl_en_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 73666474 num_examples: 1193 download_size: 41337112 dataset_size: 73666474 - config_name: miracl_en_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 10297963 num_examples: 128 download_size: 5831367 dataset_size: 10297963 - config_name: miracl_en_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 270694 num_examples: 3 download_size: 164323 dataset_size: 270694 - config_name: miracl_en_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 212426 num_examples: 2 download_size: 128411 dataset_size: 212426 - config_name: miracl_en_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 111593378 num_examples: 1537 download_size: 63131660 dataset_size: 111593378 - config_name: miracl_es_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 53995069 num_examples: 856 download_size: 30753052 dataset_size: 53995069 - config_name: miracl_es_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 9690352 num_examples: 120 download_size: 5667781 dataset_size: 9690352 - config_name: miracl_es_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1063917 num_examples: 12 download_size: 536712 dataset_size: 1063917 - config_name: miracl_es_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 189936 num_examples: 2 download_size: 127782 dataset_size: 189936 - config_name: miracl_es_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 237177 num_examples: 3 download_size: 134722 dataset_size: 237177 - config_name: miracl_es_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 85709592 num_examples: 1169 download_size: 49811576 dataset_size: 85709592 - config_name: miracl_fa_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 45973561 num_examples: 602 download_size: 20800550 dataset_size: 45973561 - config_name: miracl_fa_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 25076173 num_examples: 228 download_size: 11695842 dataset_size: 25076173 - config_name: miracl_fa_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1877900 num_examples: 16 download_size: 866171 dataset_size: 1877900 - config_name: miracl_fa_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 670773 num_examples: 5 download_size: 290935 dataset_size: 670773 - config_name: miracl_fa_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 123323855 num_examples: 1255 download_size: 57085616 dataset_size: 123323855 - config_name: miracl_fa_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 200342 num_examples: 1 download_size: 114077 dataset_size: 200342 - config_name: miracl_fi_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 113069955 num_examples: 2098 download_size: 67054506 dataset_size: 113069955 - config_name: miracl_fi_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 2258878 num_examples: 34 download_size: 1307244 dataset_size: 2258878 - config_name: miracl_fi_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 337147 num_examples: 5 download_size: 206650 dataset_size: 337147 - config_name: miracl_fi_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 47004520 num_examples: 760 download_size: 28059086 dataset_size: 47004520 - config_name: miracl_fr_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 22609769 num_examples: 404 download_size: 12682212 dataset_size: 22609769 - config_name: miracl_fr_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 4843877 num_examples: 68 download_size: 2750528 dataset_size: 4843877 - config_name: miracl_fr_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 315099 num_examples: 4 download_size: 193193 dataset_size: 315099 - config_name: miracl_fr_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 44881167 num_examples: 667 download_size: 25679983 dataset_size: 44881167 - config_name: miracl_hi_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 11923549 num_examples: 89 download_size: 4441449 dataset_size: 11923549 - config_name: miracl_hi_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 48056784 num_examples: 259 download_size: 18256392 dataset_size: 48056784 - config_name: miracl_hi_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 5337208 num_examples: 28 download_size: 2018607 dataset_size: 5337208 - config_name: miracl_hi_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1736752 num_examples: 8 download_size: 628984 dataset_size: 1736752 - config_name: miracl_hi_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1166966 num_examples: 6 download_size: 410643 dataset_size: 1166966 - config_name: miracl_hi_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 126642415 num_examples: 775 download_size: 47580319 dataset_size: 126642415 - config_name: miracl_hi_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 196066 num_examples: 1 download_size: 91078 dataset_size: 196066 - config_name: miracl_hi_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 784493 num_examples: 3 download_size: 264962 dataset_size: 784493 - config_name: miracl_id_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 129680104 num_examples: 2055 download_size: 70809101 dataset_size: 129680104 - config_name: miracl_id_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 21065645 num_examples: 275 download_size: 11603564 dataset_size: 21065645 - config_name: miracl_id_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1756259 num_examples: 20 download_size: 896099 dataset_size: 1756259 - config_name: miracl_id_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 87990 num_examples: 1 download_size: 59630 dataset_size: 87990 - config_name: miracl_id_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 122212431 num_examples: 1720 download_size: 67569192 dataset_size: 122212431 - config_name: miracl_ja_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 89519591 num_examples: 1478 download_size: 50298380 dataset_size: 89519591 - config_name: miracl_ja_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 15071822 num_examples: 191 download_size: 8401966 dataset_size: 15071822 - config_name: miracl_ja_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 411359 num_examples: 5 download_size: 241606 dataset_size: 411359 - config_name: miracl_ja_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 390260 num_examples: 5 download_size: 216049 dataset_size: 390260 - config_name: miracl_ja_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 433001 num_examples: 5 download_size: 228832 dataset_size: 433001 - config_name: miracl_ja_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 126912129 num_examples: 1790 download_size: 71428154 dataset_size: 126912129 - config_name: miracl_ja_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 105225 num_examples: 1 download_size: 72144 dataset_size: 105225 - config_name: miracl_ja_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 245928 num_examples: 2 download_size: 116153 dataset_size: 245928 - config_name: miracl_ko_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 12370241 num_examples: 211 download_size: 7011751 dataset_size: 12370241 - config_name: miracl_ko_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 8210966 num_examples: 106 download_size: 4682078 dataset_size: 8210966 - config_name: miracl_ko_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 569626 num_examples: 7 download_size: 328405 dataset_size: 569626 - config_name: miracl_ko_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 112087 num_examples: 1 download_size: 73220 dataset_size: 112087 - config_name: miracl_ko_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 38044978 num_examples: 541 download_size: 21827044 dataset_size: 38044978 - config_name: miracl_ko_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 82212 num_examples: 1 download_size: 48315 dataset_size: 82212 - config_name: miracl_ko_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 148931 num_examples: 1 download_size: 100090 dataset_size: 148931 - config_name: miracl_ru_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 121766274 num_examples: 1255 download_size: 59038249 dataset_size: 121766274 - config_name: miracl_ru_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 54376767 num_examples: 416 download_size: 26830875 dataset_size: 54376767 - config_name: miracl_ru_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 3231796 num_examples: 23 download_size: 1604726 dataset_size: 3231796 - config_name: miracl_ru_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 432793 num_examples: 3 download_size: 208938 dataset_size: 432793 - config_name: miracl_ru_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 415602 num_examples: 3 download_size: 212040 dataset_size: 415602 - config_name: miracl_ru_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 347133903 num_examples: 2982 download_size: 171233519 dataset_size: 347133903 - config_name: miracl_ru_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 143313 num_examples: 1 download_size: 87030 dataset_size: 143313 - config_name: miracl_sw_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 38111232 num_examples: 1129 download_size: 20999606 dataset_size: 38111232 - config_name: miracl_sw_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 7478214 num_examples: 132 download_size: 4261592 dataset_size: 7478214 - config_name: miracl_sw_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1554004 num_examples: 35 download_size: 444883 dataset_size: 1554004 - config_name: miracl_sw_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 28309175 num_examples: 605 download_size: 16278061 dataset_size: 28309175 - config_name: miracl_te_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 2318352 num_examples: 18 download_size: 831887 dataset_size: 2318352 - config_name: miracl_te_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 512875325 num_examples: 2349 download_size: 158941642 dataset_size: 512875325 - config_name: miracl_te_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 39530039 num_examples: 166 download_size: 13176546 dataset_size: 39530039 - config_name: miracl_te_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 7128577 num_examples: 30 download_size: 2550785 dataset_size: 7128577 - config_name: miracl_te_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 3843812 num_examples: 16 download_size: 1514525 dataset_size: 3843812 - config_name: miracl_te_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 159402673 num_examples: 862 download_size: 61311898 dataset_size: 159402673 - config_name: miracl_te_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 2974515 num_examples: 11 download_size: 1083282 dataset_size: 2974515 - config_name: miracl_th_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 128795901 num_examples: 933 download_size: 47523326 dataset_size: 128795901 - config_name: miracl_th_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 50999684 num_examples: 302 download_size: 19011281 dataset_size: 50999684 - config_name: miracl_th_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 12976578 num_examples: 68 download_size: 4802677 dataset_size: 12976578 - config_name: miracl_th_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 3559821 num_examples: 15 download_size: 1304327 dataset_size: 3559821 - config_name: miracl_th_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 3826991 num_examples: 20 download_size: 1315481 dataset_size: 3826991 - config_name: miracl_th_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 261969342 num_examples: 1634 download_size: 97305393 dataset_size: 261969342 - config_name: miracl_zh_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 32330558 num_examples: 642 download_size: 20739700 dataset_size: 32330558 - config_name: miracl_zh_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 4329136 num_examples: 67 download_size: 2779207 dataset_size: 4329136 - config_name: miracl_zh_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 189128 num_examples: 3 download_size: 137739 dataset_size: 189128 - config_name: miracl_zh_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 35265023 num_examples: 600 download_size: 23012711 dataset_size: 35265023 - config_name: mldr_ar_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 893721 num_examples: 4 download_size: 407342 dataset_size: 893721 - config_name: mldr_ar_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 373320 num_examples: 2 download_size: 152026 dataset_size: 373320 - config_name: mldr_ar_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 27924175 num_examples: 91 download_size: 13240216 dataset_size: 27924175 - config_name: mldr_ar_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 882182624 num_examples: 1720 download_size: 421130520 dataset_size: 882182624 - config_name: mldr_de_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 109333 num_examples: 1 download_size: 74254 dataset_size: 109333 - config_name: mldr_de_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 9534182 num_examples: 65 download_size: 5463891 dataset_size: 9534182 - config_name: mldr_de_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 513339675 num_examples: 1781 download_size: 291925417 dataset_size: 513339675 - config_name: mldr_en_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 834714 num_examples: 6 download_size: 450540 dataset_size: 834714 - config_name: mldr_en_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 6545614 num_examples: 38 download_size: 3650194 dataset_size: 6545614 - config_name: mldr_en_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 25274901 num_examples: 130 download_size: 14419449 dataset_size: 25274901 - config_name: mldr_en_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 61163372 num_examples: 280 download_size: 34796244 dataset_size: 61163372 - config_name: mldr_en_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 169378465 num_examples: 695 download_size: 96133520 dataset_size: 169378465 - config_name: mldr_en_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 3232103418 num_examples: 8851 download_size: 1799713395 dataset_size: 3232103418 - config_name: mldr_es_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 527480 num_examples: 3 download_size: 308917 dataset_size: 527480 - config_name: mldr_es_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 379021 num_examples: 2 download_size: 233224 dataset_size: 379021 - config_name: mldr_es_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 31986844 num_examples: 123 download_size: 18895987 dataset_size: 31986844 - config_name: mldr_es_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 770022188 num_examples: 2126 download_size: 449536327 dataset_size: 770022188 - config_name: mldr_fr_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 163375 num_examples: 1 download_size: 98688 dataset_size: 163375 - config_name: mldr_fr_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 322229 num_examples: 2 download_size: 176449 dataset_size: 322229 - config_name: mldr_fr_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 13065037 num_examples: 59 download_size: 7565693 dataset_size: 13065037 - config_name: mldr_fr_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 519513278 num_examples: 1546 download_size: 298756787 dataset_size: 519513278 - config_name: mldr_hi_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 4262782 num_examples: 10 download_size: 1587532 dataset_size: 4262782 - config_name: mldr_hi_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 53851863 num_examples: 102 download_size: 20033055 dataset_size: 53851863 - config_name: mldr_hi_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 1144693246 num_examples: 1506 download_size: 422841377 dataset_size: 1144693246 - config_name: mldr_it_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 427280 num_examples: 2 download_size: 247914 dataset_size: 427280 - config_name: mldr_it_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 10368973 num_examples: 40 download_size: 6087589 dataset_size: 10368973 - config_name: mldr_it_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 809724805 num_examples: 2109 download_size: 480868853 dataset_size: 809724805 - config_name: mldr_ja_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 148370 num_examples: 1 download_size: 93088 dataset_size: 148370 - config_name: mldr_ja_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 239548 num_examples: 1 download_size: 133215 dataset_size: 239548 - config_name: mldr_ja_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 1459218 num_examples: 6 download_size: 731686 dataset_size: 1459218 - config_name: mldr_ja_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 28534447 num_examples: 105 download_size: 16169058 dataset_size: 28534447 - config_name: mldr_ja_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 873780430 num_examples: 2149 download_size: 485339363 dataset_size: 873780430 - config_name: mldr_ko_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 630831 num_examples: 4 download_size: 358470 dataset_size: 630831 - config_name: mldr_ko_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 1277357 num_examples: 7 download_size: 712850 dataset_size: 1277357 - config_name: mldr_ko_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 201507 num_examples: 1 download_size: 118101 dataset_size: 201507 - config_name: mldr_ko_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 17287246 num_examples: 77 download_size: 9805654 dataset_size: 17287246 - config_name: mldr_ko_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 710658200 num_examples: 2109 download_size: 401885314 dataset_size: 710658200 - config_name: mldr_pt_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 688801 num_examples: 3 download_size: 382542 dataset_size: 688801 - config_name: mldr_pt_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 7710579 num_examples: 30 download_size: 4486709 dataset_size: 7710579 - config_name: mldr_pt_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 726184130 num_examples: 1812 download_size: 424739418 dataset_size: 726184130 - config_name: mldr_ru_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 269667 num_examples: 1 download_size: 139605 dataset_size: 269667 - config_name: mldr_ru_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 6549245 num_examples: 17 download_size: 3154413 dataset_size: 6549245 - config_name: mldr_ru_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 54668077 num_examples: 124 download_size: 26453691 dataset_size: 54668077 - config_name: mldr_ru_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 1068869887 num_examples: 1722 download_size: 513909995 dataset_size: 1068869887 - config_name: mldr_th_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 345041 num_examples: 4 download_size: 134956 dataset_size: 345041 - config_name: mldr_th_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 542343 num_examples: 5 download_size: 199138 dataset_size: 542343 - config_name: mldr_th_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 918047 num_examples: 8 download_size: 355232 dataset_size: 918047 - config_name: mldr_th_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 444431 num_examples: 4 download_size: 173604 dataset_size: 444431 - config_name: mldr_th_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 153711 num_examples: 1 download_size: 67370 dataset_size: 153711 - config_name: mldr_th_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 31638355 num_examples: 137 download_size: 11728640 dataset_size: 31638355 - config_name: mldr_th_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 823617348 num_examples: 1811 download_size: 313841308 dataset_size: 823617348 - config_name: mldr_zh_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 90373 num_examples: 1 download_size: 54737 dataset_size: 90373 - config_name: mldr_zh_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 105050 num_examples: 1 download_size: 47309 dataset_size: 105050 - config_name: mldr_zh_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 1221668 num_examples: 7 download_size: 604043 dataset_size: 1221668 - config_name: mldr_zh_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 8970561 num_examples: 44 download_size: 5562759 dataset_size: 8970561 - config_name: mldr_zh_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 34281928 num_examples: 137 download_size: 21377471 dataset_size: 34281928 - config_name: mldr_zh_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 155153682 num_examples: 550 download_size: 95006381 dataset_size: 155153682 - config_name: mldr_zh_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 3814797659 num_examples: 9260 download_size: 2356033496 dataset_size: 3814797659 - config_name: mmarco_chinese_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 514566583 num_examples: 100000 download_size: 325460355 dataset_size: 514566583 - config_name: mr-tydi_arabic_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 109831291 num_examples: 1400 download_size: 52934924 dataset_size: 109831291 - config_name: mr-tydi_arabic_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 364612496 num_examples: 3334 download_size: 179663040 dataset_size: 364612496 - config_name: mr-tydi_arabic_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 54296933 num_examples: 437 download_size: 26595161 dataset_size: 54296933 - config_name: mr-tydi_arabic_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 17810570 num_examples: 139 download_size: 8855937 dataset_size: 17810570 - config_name: mr-tydi_arabic_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 4293037 num_examples: 36 download_size: 2010238 dataset_size: 4293037 - config_name: mr-tydi_arabic_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 668941248 num_examples: 6994 download_size: 326870902 dataset_size: 668941248 - config_name: mr-tydi_arabic_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1200109 num_examples: 9 download_size: 470739 dataset_size: 1200109 - config_name: mr-tydi_arabic_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1663811 num_examples: 13 download_size: 428516 dataset_size: 1663811 - config_name: mr-tydi_arabic_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 2143387 num_examples: 15 download_size: 895954 dataset_size: 2143387 - config_name: mr-tydi_bengali_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 15178521 num_examples: 112 download_size: 5546971 dataset_size: 15178521 - config_name: mr-tydi_bengali_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 71166136 num_examples: 396 download_size: 26453230 dataset_size: 71166136 - config_name: mr-tydi_bengali_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 7469245 num_examples: 41 download_size: 2803707 dataset_size: 7469245 - config_name: mr-tydi_bengali_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 2375946 num_examples: 12 download_size: 863301 dataset_size: 2375946 - config_name: mr-tydi_bengali_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1981432 num_examples: 9 download_size: 680271 dataset_size: 1981432 - config_name: mr-tydi_bengali_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 182866914 num_examples: 1131 download_size: 67762515 dataset_size: 182866914 - config_name: mr-tydi_bengali_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1502452 num_examples: 8 download_size: 420489 dataset_size: 1502452 - config_name: mr-tydi_bengali_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 925581 num_examples: 4 download_size: 287677 dataset_size: 925581 - config_name: mr-tydi_english_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 88083413 num_examples: 1452 download_size: 49537360 dataset_size: 88083413 - config_name: mr-tydi_english_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 15769495 num_examples: 206 download_size: 8959983 dataset_size: 15769495 - config_name: mr-tydi_english_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1166656 num_examples: 15 download_size: 628836 dataset_size: 1166656 - config_name: mr-tydi_english_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 526285 num_examples: 5 download_size: 302625 dataset_size: 526285 - config_name: mr-tydi_english_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 341130 num_examples: 4 download_size: 193671 dataset_size: 341130 - config_name: mr-tydi_english_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 131725671 num_examples: 1864 download_size: 74886308 dataset_size: 131725671 - config_name: mr-tydi_english_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 118571 num_examples: 1 download_size: 78091 dataset_size: 118571 - config_name: mr-tydi_finnish_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 223439402 num_examples: 4152 download_size: 132585080 dataset_size: 223439402 - config_name: mr-tydi_finnish_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 16396022 num_examples: 264 download_size: 9741476 dataset_size: 16396022 - config_name: mr-tydi_finnish_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 5242522 num_examples: 84 download_size: 3052996 dataset_size: 5242522 - config_name: mr-tydi_finnish_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 139342 num_examples: 2 download_size: 86591 dataset_size: 139342 - config_name: mr-tydi_finnish_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 116902 num_examples: 2 download_size: 69754 dataset_size: 116902 - config_name: mr-tydi_finnish_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 124364870 num_examples: 2043 download_size: 74472384 dataset_size: 124364870 - config_name: mr-tydi_finnish_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 905421 num_examples: 10 download_size: 350451 dataset_size: 905421 - config_name: mr-tydi_finnish_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 310536 num_examples: 4 download_size: 145261 dataset_size: 310536 - config_name: mr-tydi_indonesian_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 141798318 num_examples: 2310 download_size: 77676550 dataset_size: 141798318 - config_name: mr-tydi_indonesian_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 30582758 num_examples: 415 download_size: 17064542 dataset_size: 30582758 - config_name: mr-tydi_indonesian_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 4960304 num_examples: 60 download_size: 2668392 dataset_size: 4960304 - config_name: mr-tydi_indonesian_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1183880 num_examples: 11 download_size: 418275 dataset_size: 1183880 - config_name: mr-tydi_indonesian_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 488368 num_examples: 5 download_size: 239425 dataset_size: 488368 - config_name: mr-tydi_indonesian_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 144492491 num_examples: 2098 download_size: 80151689 dataset_size: 144492491 - config_name: mr-tydi_indonesian_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 465861 num_examples: 3 download_size: 243140 dataset_size: 465861 - config_name: mr-tydi_japanese_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 66287105 num_examples: 1104 download_size: 37312208 dataset_size: 66287105 - config_name: mr-tydi_japanese_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 37895545 num_examples: 507 download_size: 21353355 dataset_size: 37895545 - config_name: mr-tydi_japanese_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 11177970 num_examples: 135 download_size: 6191742 dataset_size: 11177970 - config_name: mr-tydi_japanese_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 3898518 num_examples: 44 download_size: 2107551 dataset_size: 3898518 - config_name: mr-tydi_japanese_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 3385346 num_examples: 34 download_size: 1744962 dataset_size: 3385346 - config_name: mr-tydi_japanese_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 129362291 num_examples: 1857 download_size: 72929065 dataset_size: 129362291 - config_name: mr-tydi_japanese_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1031689 num_examples: 9 download_size: 531611 dataset_size: 1031689 - config_name: mr-tydi_japanese_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 780374 num_examples: 7 download_size: 300014 dataset_size: 780374 - config_name: mr-tydi_korean_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 17252978 num_examples: 293 download_size: 9847774 dataset_size: 17252978 - config_name: mr-tydi_korean_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 13308245 num_examples: 173 download_size: 7635761 dataset_size: 13308245 - config_name: mr-tydi_korean_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1874814 num_examples: 22 download_size: 983812 dataset_size: 1874814 - config_name: mr-tydi_korean_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 123949 num_examples: 1 download_size: 79337 dataset_size: 123949 - config_name: mr-tydi_korean_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 55548388 num_examples: 801 download_size: 32020512 dataset_size: 55548388 - config_name: mr-tydi_korean_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 506148 num_examples: 5 download_size: 204625 dataset_size: 506148 - config_name: mr-tydi_russian_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 107148401 num_examples: 1139 download_size: 52008602 dataset_size: 107148401 - config_name: mr-tydi_russian_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 96921758 num_examples: 787 download_size: 47648140 dataset_size: 96921758 - config_name: mr-tydi_russian_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 13253241 num_examples: 102 download_size: 6496787 dataset_size: 13253241 - config_name: mr-tydi_russian_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 5145699 num_examples: 41 download_size: 2500744 dataset_size: 5145699 - config_name: mr-tydi_russian_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 2115118 num_examples: 17 download_size: 941263 dataset_size: 2115118 - config_name: mr-tydi_russian_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 371247497 num_examples: 3264 download_size: 183091309 dataset_size: 371247497 - config_name: mr-tydi_russian_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1691246 num_examples: 11 download_size: 799287 dataset_size: 1691246 - config_name: mr-tydi_russian_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 486505 num_examples: 3 download_size: 239819 dataset_size: 486505 - config_name: mr-tydi_russian_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 336319 num_examples: 2 download_size: 169746 dataset_size: 336319 - config_name: mr-tydi_swahili_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 32623622 num_examples: 891 download_size: 18315585 dataset_size: 32623622 - config_name: mr-tydi_swahili_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 11660403 num_examples: 235 download_size: 6526721 dataset_size: 11660403 - config_name: mr-tydi_swahili_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 878141 num_examples: 20 download_size: 332091 dataset_size: 878141 - config_name: mr-tydi_swahili_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1549496 num_examples: 35 download_size: 446721 dataset_size: 1549496 - config_name: mr-tydi_swahili_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 37994264 num_examples: 891 download_size: 21852760 dataset_size: 37994264 - config_name: mr-tydi_telugu_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 3810906 num_examples: 30 download_size: 1342241 dataset_size: 3810906 - config_name: mr-tydi_telugu_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 506945921 num_examples: 2574 download_size: 155659518 dataset_size: 506945921 - config_name: mr-tydi_telugu_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 39824999 num_examples: 187 download_size: 13700530 dataset_size: 39824999 - config_name: mr-tydi_telugu_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 7929121 num_examples: 35 download_size: 2926610 dataset_size: 7929121 - config_name: mr-tydi_telugu_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1595960 num_examples: 7 download_size: 634904 dataset_size: 1595960 - config_name: mr-tydi_telugu_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 184695730 num_examples: 1039 download_size: 71256267 dataset_size: 184695730 - config_name: mr-tydi_telugu_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 2089040 num_examples: 8 download_size: 740666 dataset_size: 2089040 - config_name: mr-tydi_thai_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 114296856 num_examples: 843 download_size: 42487524 dataset_size: 114296856 - config_name: mr-tydi_thai_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 89508170 num_examples: 532 download_size: 33086505 dataset_size: 89508170 - config_name: mr-tydi_thai_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 27441702 num_examples: 151 download_size: 10173779 dataset_size: 27441702 - config_name: mr-tydi_thai_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 7604146 num_examples: 38 download_size: 2728986 dataset_size: 7604146 - config_name: mr-tydi_thai_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1261330 num_examples: 6 download_size: 438155 dataset_size: 1261330 - config_name: mr-tydi_thai_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 267519554 num_examples: 1740 download_size: 99804944 dataset_size: 267519554 - config_name: mr-tydi_thai_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 2323538 num_examples: 9 download_size: 554048 dataset_size: 2323538 - config_name: msmarco_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 36846421571 num_examples: 476968 download_size: 17203505423 dataset_size: 36846421571 - config_name: msmarco_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 221880847 num_examples: 2655 download_size: 110073583 dataset_size: 221880847 - config_name: msmarco_len-2000-3000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 76254058 num_examples: 859 download_size: 37992903 dataset_size: 76254058 - config_name: msmarco_len-3000-4000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 46382653 num_examples: 430 download_size: 21559992 dataset_size: 46382653 - config_name: msmarco_len-4000-5000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 13365567 num_examples: 120 download_size: 6017121 dataset_size: 13365567 - config_name: msmarco_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 335779279 num_examples: 4245 download_size: 167544770 dataset_size: 335779279 - config_name: msmarco_len-5000-6000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 7235495 num_examples: 72 download_size: 3556526 dataset_size: 7235495 - config_name: msmarco_len-6000-7000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 14032531 num_examples: 110 download_size: 6482430 dataset_size: 14032531 - config_name: msmarco_len-7000-inf features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 79268072 num_examples: 446 download_size: 42191834 dataset_size: 79268072 - config_name: nli_for_simcse_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 131931702 num_examples: 274951 download_size: 82321217 dataset_size: 131931702 - config_name: nq_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 3739882427 num_examples: 58554 download_size: 2078742331 dataset_size: 3739882427 - config_name: nq_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 905220 num_examples: 14 download_size: 449370 dataset_size: 905220 - config_name: pubmed_qa_labeled_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 4107058 num_examples: 500 download_size: 2303044 dataset_size: 4107058 - config_name: squad_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 512895039 num_examples: 85710 download_size: 295437582 dataset_size: 512895039 - config_name: squad_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 14482542 num_examples: 1889 download_size: 8514097 dataset_size: 14482542 - config_name: t2ranking_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 45761264 num_examples: 3837 download_size: 30374711 dataset_size: 45761264 - config_name: t2ranking_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string - name: pos_scores list: float64 - name: neg_scores list: float64 splits: - name: train num_bytes: 1645794162 num_examples: 86630 download_size: 1075778601 dataset_size: 1645794162 - config_name: trivia_len-0-500 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 3840129530 num_examples: 60283 download_size: 2039630774 dataset_size: 3840129530 - config_name: trivia_len-1000-2000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 461799 num_examples: 7 download_size: 121400 dataset_size: 461799 - config_name: trivia_len-500-1000 features: - name: query dtype: string - name: pos list: string - name: neg list: string splits: - name: train num_bytes: 1645921 num_examples: 25 download_size: 836676 dataset_size: 1645921 configs: - config_name: ATEC_len-0-500 data_files: - split: train path: ATEC_len-0-500/train-* - config_name: BQ_len-0-500 data_files: - split: train path: BQ_len-0-500/train-* - config_name: LCQMC_len-0-500 data_files: - split: train path: LCQMC_len-0-500/train-* - config_name: PAWSX_len-0-500 data_files: - split: train path: PAWSX_len-0-500/train-* - config_name: QBQTC_v2_len-0-500 data_files: - split: train path: QBQTC_v2_len-0-500/train-* - config_name: STS-B_len-0-500 data_files: - split: train path: STS-B_len-0-500/train-* - config_name: afqmc_len-0-500 data_files: - split: train path: afqmc_len-0-500/train-* - config_name: cMedQAv2_len-0-500 data_files: - split: train path: cMedQAv2_len-0-500/train-* - config_name: colliee_len-0-500 data_files: - split: train path: colliee_len-0-500/train-* - config_name: dureader_len-0-500 data_files: - split: train path: dureader_len-0-500/train-* - config_name: dureader_len-1000-2000 data_files: - split: train path: dureader_len-1000-2000/train-* - config_name: dureader_len-2000-3000 data_files: - split: train path: dureader_len-2000-3000/train-* - config_name: dureader_len-3000-4000 data_files: - split: train path: dureader_len-3000-4000/train-* - config_name: dureader_len-4000-5000 data_files: - split: train path: dureader_len-4000-5000/train-* - config_name: dureader_len-500-1000 data_files: - split: train path: dureader_len-500-1000/train-* - config_name: dureader_len-5000-6000 data_files: - split: train path: dureader_len-5000-6000/train-* - config_name: dureader_len-6000-7000 data_files: - split: train path: dureader_len-6000-7000/train-* - config_name: dureader_len-7000-inf data_files: - split: train path: dureader_len-7000-inf/train-* - config_name: hotpotqa_len-0-500 data_files: - split: train path: hotpotqa_len-0-500/train-* - config_name: hotpotqa_len-500-1000 data_files: - split: train path: hotpotqa_len-500-1000/train-* - config_name: law_gpt_len-0-500 data_files: - split: train path: law_gpt_len-0-500/train-* - config_name: lecardv2_len-7000-inf data_files: - split: train path: lecardv2_len-7000-inf/train-* - config_name: miracl_ar_len-0-500 data_files: - split: train path: miracl_ar_len-0-500/train-* - config_name: miracl_ar_len-1000-2000 data_files: - split: train path: miracl_ar_len-1000-2000/train-* - config_name: miracl_ar_len-2000-3000 data_files: - split: train path: miracl_ar_len-2000-3000/train-* - config_name: miracl_ar_len-3000-4000 data_files: - split: train path: miracl_ar_len-3000-4000/train-* - config_name: miracl_ar_len-4000-5000 data_files: - split: train path: miracl_ar_len-4000-5000/train-* - config_name: miracl_ar_len-500-1000 data_files: - split: train path: miracl_ar_len-500-1000/train-* - config_name: miracl_ar_len-5000-6000 data_files: - split: train path: miracl_ar_len-5000-6000/train-* - config_name: miracl_ar_len-6000-7000 data_files: - split: train path: miracl_ar_len-6000-7000/train-* - config_name: miracl_ar_len-7000-inf data_files: - split: train path: miracl_ar_len-7000-inf/train-* - config_name: miracl_bn_len-0-500 data_files: - split: train path: miracl_bn_len-0-500/train-* - config_name: miracl_bn_len-1000-2000 data_files: - split: train path: miracl_bn_len-1000-2000/train-* - config_name: miracl_bn_len-2000-3000 data_files: - split: train path: miracl_bn_len-2000-3000/train-* - config_name: miracl_bn_len-3000-4000 data_files: - split: train path: miracl_bn_len-3000-4000/train-* - config_name: miracl_bn_len-4000-5000 data_files: - split: train path: miracl_bn_len-4000-5000/train-* - config_name: miracl_bn_len-500-1000 data_files: - split: train path: miracl_bn_len-500-1000/train-* - config_name: miracl_bn_len-5000-6000 data_files: - split: train path: miracl_bn_len-5000-6000/train-* - config_name: miracl_en_len-0-500 data_files: - split: train path: miracl_en_len-0-500/train-* - config_name: miracl_en_len-1000-2000 data_files: - split: train path: miracl_en_len-1000-2000/train-* - config_name: miracl_en_len-2000-3000 data_files: - split: train path: miracl_en_len-2000-3000/train-* - config_name: miracl_en_len-3000-4000 data_files: - split: train path: miracl_en_len-3000-4000/train-* - config_name: miracl_en_len-500-1000 data_files: - split: train path: miracl_en_len-500-1000/train-* - config_name: miracl_es_len-0-500 data_files: - split: train path: miracl_es_len-0-500/train-* - config_name: miracl_es_len-1000-2000 data_files: - split: train path: miracl_es_len-1000-2000/train-* - config_name: miracl_es_len-2000-3000 data_files: - split: train path: miracl_es_len-2000-3000/train-* - config_name: miracl_es_len-3000-4000 data_files: - split: train path: miracl_es_len-3000-4000/train-* - config_name: miracl_es_len-4000-5000 data_files: - split: train path: miracl_es_len-4000-5000/train-* - config_name: miracl_es_len-500-1000 data_files: - split: train path: miracl_es_len-500-1000/train-* - config_name: miracl_fa_len-0-500 data_files: - split: train path: miracl_fa_len-0-500/train-* - config_name: miracl_fa_len-1000-2000 data_files: - split: train path: miracl_fa_len-1000-2000/train-* - config_name: miracl_fa_len-2000-3000 data_files: - split: train path: miracl_fa_len-2000-3000/train-* - config_name: miracl_fa_len-3000-4000 data_files: - split: train path: miracl_fa_len-3000-4000/train-* - config_name: miracl_fa_len-500-1000 data_files: - split: train path: miracl_fa_len-500-1000/train-* - config_name: miracl_fa_len-7000-inf data_files: - split: train path: miracl_fa_len-7000-inf/train-* - config_name: miracl_fi_len-0-500 data_files: - split: train path: miracl_fi_len-0-500/train-* - config_name: miracl_fi_len-1000-2000 data_files: - split: train path: miracl_fi_len-1000-2000/train-* - config_name: miracl_fi_len-2000-3000 data_files: - split: train path: miracl_fi_len-2000-3000/train-* - config_name: miracl_fi_len-500-1000 data_files: - split: train path: miracl_fi_len-500-1000/train-* - config_name: miracl_fr_len-0-500 data_files: - split: train path: miracl_fr_len-0-500/train-* - config_name: miracl_fr_len-1000-2000 data_files: - split: train path: miracl_fr_len-1000-2000/train-* - config_name: miracl_fr_len-2000-3000 data_files: - split: train path: miracl_fr_len-2000-3000/train-* - config_name: miracl_fr_len-500-1000 data_files: - split: train path: miracl_fr_len-500-1000/train-* - config_name: miracl_hi_len-0-500 data_files: - split: train path: miracl_hi_len-0-500/train-* - config_name: miracl_hi_len-1000-2000 data_files: - split: train path: miracl_hi_len-1000-2000/train-* - config_name: miracl_hi_len-2000-3000 data_files: - split: train path: miracl_hi_len-2000-3000/train-* - config_name: miracl_hi_len-3000-4000 data_files: - split: train path: miracl_hi_len-3000-4000/train-* - config_name: miracl_hi_len-4000-5000 data_files: - split: train path: miracl_hi_len-4000-5000/train-* - config_name: miracl_hi_len-500-1000 data_files: - split: train path: miracl_hi_len-500-1000/train-* - config_name: miracl_hi_len-5000-6000 data_files: - split: train path: miracl_hi_len-5000-6000/train-* - config_name: miracl_hi_len-7000-inf data_files: - split: train path: miracl_hi_len-7000-inf/train-* - config_name: miracl_id_len-0-500 data_files: - split: train path: miracl_id_len-0-500/train-* - config_name: miracl_id_len-1000-2000 data_files: - split: train path: miracl_id_len-1000-2000/train-* - config_name: miracl_id_len-2000-3000 data_files: - split: train path: miracl_id_len-2000-3000/train-* - config_name: miracl_id_len-3000-4000 data_files: - split: train path: miracl_id_len-3000-4000/train-* - config_name: miracl_id_len-500-1000 data_files: - split: train path: miracl_id_len-500-1000/train-* - config_name: miracl_ja_len-0-500 data_files: - split: train path: miracl_ja_len-0-500/train-* - config_name: miracl_ja_len-1000-2000 data_files: - split: train path: miracl_ja_len-1000-2000/train-* - config_name: miracl_ja_len-2000-3000 data_files: - split: train path: miracl_ja_len-2000-3000/train-* - config_name: miracl_ja_len-3000-4000 data_files: - split: train path: miracl_ja_len-3000-4000/train-* - config_name: miracl_ja_len-4000-5000 data_files: - split: train path: miracl_ja_len-4000-5000/train-* - config_name: miracl_ja_len-500-1000 data_files: - split: train path: miracl_ja_len-500-1000/train-* - config_name: miracl_ja_len-6000-7000 data_files: - split: train path: miracl_ja_len-6000-7000/train-* - config_name: miracl_ja_len-7000-inf data_files: - split: train path: miracl_ja_len-7000-inf/train-* - config_name: miracl_ko_len-0-500 data_files: - split: train path: miracl_ko_len-0-500/train-* - config_name: miracl_ko_len-1000-2000 data_files: - split: train path: miracl_ko_len-1000-2000/train-* - config_name: miracl_ko_len-2000-3000 data_files: - split: train path: miracl_ko_len-2000-3000/train-* - config_name: miracl_ko_len-3000-4000 data_files: - split: train path: miracl_ko_len-3000-4000/train-* - config_name: miracl_ko_len-500-1000 data_files: - split: train path: miracl_ko_len-500-1000/train-* - config_name: miracl_ko_len-5000-6000 data_files: - split: train path: miracl_ko_len-5000-6000/train-* - config_name: miracl_ko_len-7000-inf data_files: - split: train path: miracl_ko_len-7000-inf/train-* - config_name: miracl_ru_len-0-500 data_files: - split: train path: miracl_ru_len-0-500/train-* - config_name: miracl_ru_len-1000-2000 data_files: - split: train path: miracl_ru_len-1000-2000/train-* - config_name: miracl_ru_len-2000-3000 data_files: - split: train path: miracl_ru_len-2000-3000/train-* - config_name: miracl_ru_len-3000-4000 data_files: - split: train path: miracl_ru_len-3000-4000/train-* - config_name: miracl_ru_len-4000-5000 data_files: - split: train path: miracl_ru_len-4000-5000/train-* - config_name: miracl_ru_len-500-1000 data_files: - split: train path: miracl_ru_len-500-1000/train-* - config_name: miracl_ru_len-7000-inf data_files: - split: train path: miracl_ru_len-7000-inf/train-* - config_name: miracl_sw_len-0-500 data_files: - split: train path: miracl_sw_len-0-500/train-* - config_name: miracl_sw_len-1000-2000 data_files: - split: train path: miracl_sw_len-1000-2000/train-* - config_name: miracl_sw_len-3000-4000 data_files: - split: train path: miracl_sw_len-3000-4000/train-* - config_name: miracl_sw_len-500-1000 data_files: - split: train path: miracl_sw_len-500-1000/train-* - config_name: miracl_te_len-0-500 data_files: - split: train path: miracl_te_len-0-500/train-* - config_name: miracl_te_len-1000-2000 data_files: - split: train path: miracl_te_len-1000-2000/train-* - config_name: miracl_te_len-2000-3000 data_files: - split: train path: miracl_te_len-2000-3000/train-* - config_name: miracl_te_len-3000-4000 data_files: - split: train path: miracl_te_len-3000-4000/train-* - config_name: miracl_te_len-4000-5000 data_files: - split: train path: miracl_te_len-4000-5000/train-* - config_name: miracl_te_len-500-1000 data_files: - split: train path: miracl_te_len-500-1000/train-* - config_name: miracl_te_len-5000-6000 data_files: - split: train path: miracl_te_len-5000-6000/train-* - config_name: miracl_th_len-0-500 data_files: - split: train path: miracl_th_len-0-500/train-* - config_name: miracl_th_len-1000-2000 data_files: - split: train path: miracl_th_len-1000-2000/train-* - config_name: miracl_th_len-2000-3000 data_files: - split: train path: miracl_th_len-2000-3000/train-* - config_name: miracl_th_len-3000-4000 data_files: - split: train path: miracl_th_len-3000-4000/train-* - config_name: miracl_th_len-4000-5000 data_files: - split: train path: miracl_th_len-4000-5000/train-* - config_name: miracl_th_len-500-1000 data_files: - split: train path: miracl_th_len-500-1000/train-* - config_name: miracl_zh_len-0-500 data_files: - split: train path: miracl_zh_len-0-500/train-* - config_name: miracl_zh_len-1000-2000 data_files: - split: train path: miracl_zh_len-1000-2000/train-* - config_name: miracl_zh_len-2000-3000 data_files: - split: train path: miracl_zh_len-2000-3000/train-* - config_name: miracl_zh_len-500-1000 data_files: - split: train path: miracl_zh_len-500-1000/train-* - config_name: mldr_ar_len-4000-5000 data_files: - split: train path: mldr_ar_len-4000-5000/train-* - config_name: mldr_ar_len-5000-6000 data_files: - split: train path: mldr_ar_len-5000-6000/train-* - config_name: mldr_ar_len-6000-7000 data_files: - split: train path: mldr_ar_len-6000-7000/train-* - config_name: mldr_ar_len-7000-inf data_files: - split: train path: mldr_ar_len-7000-inf/train-* - config_name: mldr_de_len-5000-6000 data_files: - split: train path: mldr_de_len-5000-6000/train-* - config_name: mldr_de_len-6000-7000 data_files: - split: train path: mldr_de_len-6000-7000/train-* - config_name: mldr_de_len-7000-inf data_files: - split: train path: mldr_de_len-7000-inf/train-* - config_name: mldr_en_len-2000-3000 data_files: - split: train path: mldr_en_len-2000-3000/train-* - config_name: mldr_en_len-3000-4000 data_files: - split: train path: mldr_en_len-3000-4000/train-* - config_name: mldr_en_len-4000-5000 data_files: - split: train path: mldr_en_len-4000-5000/train-* - config_name: mldr_en_len-5000-6000 data_files: - split: train path: mldr_en_len-5000-6000/train-* - config_name: mldr_en_len-6000-7000 data_files: - split: train path: mldr_en_len-6000-7000/train-* - config_name: mldr_en_len-7000-inf data_files: - split: train path: mldr_en_len-7000-inf/train-* - config_name: mldr_es_len-4000-5000 data_files: - split: train path: mldr_es_len-4000-5000/train-* - config_name: mldr_es_len-5000-6000 data_files: - split: train path: mldr_es_len-5000-6000/train-* - config_name: mldr_es_len-6000-7000 data_files: - split: train path: mldr_es_len-6000-7000/train-* - config_name: mldr_es_len-7000-inf data_files: - split: train path: mldr_es_len-7000-inf/train-* - config_name: mldr_fr_len-4000-5000 data_files: - split: train path: mldr_fr_len-4000-5000/train-* - config_name: mldr_fr_len-5000-6000 data_files: - split: train path: mldr_fr_len-5000-6000/train-* - config_name: mldr_fr_len-6000-7000 data_files: - split: train path: mldr_fr_len-6000-7000/train-* - config_name: mldr_fr_len-7000-inf data_files: - split: train path: mldr_fr_len-7000-inf/train-* - config_name: mldr_hi_len-5000-6000 data_files: - split: train path: mldr_hi_len-5000-6000/train-* - config_name: mldr_hi_len-6000-7000 data_files: - split: train path: mldr_hi_len-6000-7000/train-* - config_name: mldr_hi_len-7000-inf data_files: - split: train path: mldr_hi_len-7000-inf/train-* - config_name: mldr_it_len-5000-6000 data_files: - split: train path: mldr_it_len-5000-6000/train-* - config_name: mldr_it_len-6000-7000 data_files: - split: train path: mldr_it_len-6000-7000/train-* - config_name: mldr_it_len-7000-inf data_files: - split: train path: mldr_it_len-7000-inf/train-* - config_name: mldr_ja_len-2000-3000 data_files: - split: train path: mldr_ja_len-2000-3000/train-* - config_name: mldr_ja_len-4000-5000 data_files: - split: train path: mldr_ja_len-4000-5000/train-* - config_name: mldr_ja_len-5000-6000 data_files: - split: train path: mldr_ja_len-5000-6000/train-* - config_name: mldr_ja_len-6000-7000 data_files: - split: train path: mldr_ja_len-6000-7000/train-* - config_name: mldr_ja_len-7000-inf data_files: - split: train path: mldr_ja_len-7000-inf/train-* - config_name: mldr_ko_len-3000-4000 data_files: - split: train path: mldr_ko_len-3000-4000/train-* - config_name: mldr_ko_len-4000-5000 data_files: - split: train path: mldr_ko_len-4000-5000/train-* - config_name: mldr_ko_len-5000-6000 data_files: - split: train path: mldr_ko_len-5000-6000/train-* - config_name: mldr_ko_len-6000-7000 data_files: - split: train path: mldr_ko_len-6000-7000/train-* - config_name: mldr_ko_len-7000-inf data_files: - split: train path: mldr_ko_len-7000-inf/train-* - config_name: mldr_pt_len-5000-6000 data_files: - split: train path: mldr_pt_len-5000-6000/train-* - config_name: mldr_pt_len-6000-7000 data_files: - split: train path: mldr_pt_len-6000-7000/train-* - config_name: mldr_pt_len-7000-inf data_files: - split: train path: mldr_pt_len-7000-inf/train-* - config_name: mldr_ru_len-3000-4000 data_files: - split: train path: mldr_ru_len-3000-4000/train-* - config_name: mldr_ru_len-5000-6000 data_files: - split: train path: mldr_ru_len-5000-6000/train-* - config_name: mldr_ru_len-6000-7000 data_files: - split: train path: mldr_ru_len-6000-7000/train-* - config_name: mldr_ru_len-7000-inf data_files: - split: train path: mldr_ru_len-7000-inf/train-* - config_name: mldr_th_len-1000-2000 data_files: - split: train path: mldr_th_len-1000-2000/train-* - config_name: mldr_th_len-2000-3000 data_files: - split: train path: mldr_th_len-2000-3000/train-* - config_name: mldr_th_len-3000-4000 data_files: - split: train path: mldr_th_len-3000-4000/train-* - config_name: mldr_th_len-4000-5000 data_files: - split: train path: mldr_th_len-4000-5000/train-* - config_name: mldr_th_len-5000-6000 data_files: - split: train path: mldr_th_len-5000-6000/train-* - config_name: mldr_th_len-6000-7000 data_files: - split: train path: mldr_th_len-6000-7000/train-* - config_name: mldr_th_len-7000-inf data_files: - split: train path: mldr_th_len-7000-inf/train-* - config_name: mldr_zh_len-1000-2000 data_files: - split: train path: mldr_zh_len-1000-2000/train-* - config_name: mldr_zh_len-2000-3000 data_files: - split: train path: mldr_zh_len-2000-3000/train-* - config_name: mldr_zh_len-3000-4000 data_files: - split: train path: mldr_zh_len-3000-4000/train-* - config_name: mldr_zh_len-4000-5000 data_files: - split: train path: mldr_zh_len-4000-5000/train-* - config_name: mldr_zh_len-5000-6000 data_files: - split: train path: mldr_zh_len-5000-6000/train-* - config_name: mldr_zh_len-6000-7000 data_files: - split: train path: mldr_zh_len-6000-7000/train-* - config_name: mldr_zh_len-7000-inf data_files: - split: train path: mldr_zh_len-7000-inf/train-* - config_name: mmarco_chinese_len-0-500 data_files: - split: train path: mmarco_chinese_len-0-500/train-* - config_name: mr-tydi_arabic_len-0-500 data_files: - split: train path: mr-tydi_arabic_len-0-500/train-* - config_name: mr-tydi_arabic_len-1000-2000 data_files: - split: train path: mr-tydi_arabic_len-1000-2000/train-* - config_name: mr-tydi_arabic_len-2000-3000 data_files: - split: train path: mr-tydi_arabic_len-2000-3000/train-* - config_name: mr-tydi_arabic_len-3000-4000 data_files: - split: train path: mr-tydi_arabic_len-3000-4000/train-* - config_name: mr-tydi_arabic_len-4000-5000 data_files: - split: train path: mr-tydi_arabic_len-4000-5000/train-* - config_name: mr-tydi_arabic_len-500-1000 data_files: - split: train path: mr-tydi_arabic_len-500-1000/train-* - config_name: mr-tydi_arabic_len-5000-6000 data_files: - split: train path: mr-tydi_arabic_len-5000-6000/train-* - config_name: mr-tydi_arabic_len-6000-7000 data_files: - split: train path: mr-tydi_arabic_len-6000-7000/train-* - config_name: mr-tydi_arabic_len-7000-inf data_files: - split: train path: mr-tydi_arabic_len-7000-inf/train-* - config_name: mr-tydi_bengali_len-0-500 data_files: - split: train path: mr-tydi_bengali_len-0-500/train-* - config_name: mr-tydi_bengali_len-1000-2000 data_files: - split: train path: mr-tydi_bengali_len-1000-2000/train-* - config_name: mr-tydi_bengali_len-2000-3000 data_files: - split: train path: mr-tydi_bengali_len-2000-3000/train-* - config_name: mr-tydi_bengali_len-3000-4000 data_files: - split: train path: mr-tydi_bengali_len-3000-4000/train-* - config_name: mr-tydi_bengali_len-4000-5000 data_files: - split: train path: mr-tydi_bengali_len-4000-5000/train-* - config_name: mr-tydi_bengali_len-500-1000 data_files: - split: train path: mr-tydi_bengali_len-500-1000/train-* - config_name: mr-tydi_bengali_len-5000-6000 data_files: - split: train path: mr-tydi_bengali_len-5000-6000/train-* - config_name: mr-tydi_bengali_len-6000-7000 data_files: - split: train path: mr-tydi_bengali_len-6000-7000/train-* - config_name: mr-tydi_english_len-0-500 data_files: - split: train path: mr-tydi_english_len-0-500/train-* - config_name: mr-tydi_english_len-1000-2000 data_files: - split: train path: mr-tydi_english_len-1000-2000/train-* - config_name: mr-tydi_english_len-2000-3000 data_files: - split: train path: mr-tydi_english_len-2000-3000/train-* - config_name: mr-tydi_english_len-3000-4000 data_files: - split: train path: mr-tydi_english_len-3000-4000/train-* - config_name: mr-tydi_english_len-4000-5000 data_files: - split: train path: mr-tydi_english_len-4000-5000/train-* - config_name: mr-tydi_english_len-500-1000 data_files: - split: train path: mr-tydi_english_len-500-1000/train-* - config_name: mr-tydi_english_len-5000-6000 data_files: - split: train path: mr-tydi_english_len-5000-6000/train-* - config_name: mr-tydi_finnish_len-0-500 data_files: - split: train path: mr-tydi_finnish_len-0-500/train-* - config_name: mr-tydi_finnish_len-1000-2000 data_files: - split: train path: mr-tydi_finnish_len-1000-2000/train-* - config_name: mr-tydi_finnish_len-2000-3000 data_files: - split: train path: mr-tydi_finnish_len-2000-3000/train-* - config_name: mr-tydi_finnish_len-3000-4000 data_files: - split: train path: mr-tydi_finnish_len-3000-4000/train-* - config_name: mr-tydi_finnish_len-4000-5000 data_files: - split: train path: mr-tydi_finnish_len-4000-5000/train-* - config_name: mr-tydi_finnish_len-500-1000 data_files: - split: train path: mr-tydi_finnish_len-500-1000/train-* - config_name: mr-tydi_finnish_len-6000-7000 data_files: - split: train path: mr-tydi_finnish_len-6000-7000/train-* - config_name: mr-tydi_finnish_len-7000-inf data_files: - split: train path: mr-tydi_finnish_len-7000-inf/train-* - config_name: mr-tydi_indonesian_len-0-500 data_files: - split: train path: mr-tydi_indonesian_len-0-500/train-* - config_name: mr-tydi_indonesian_len-1000-2000 data_files: - split: train path: mr-tydi_indonesian_len-1000-2000/train-* - config_name: mr-tydi_indonesian_len-2000-3000 data_files: - split: train path: mr-tydi_indonesian_len-2000-3000/train-* - config_name: mr-tydi_indonesian_len-3000-4000 data_files: - split: train path: mr-tydi_indonesian_len-3000-4000/train-* - config_name: mr-tydi_indonesian_len-4000-5000 data_files: - split: train path: mr-tydi_indonesian_len-4000-5000/train-* - config_name: mr-tydi_indonesian_len-500-1000 data_files: - split: train path: mr-tydi_indonesian_len-500-1000/train-* - config_name: mr-tydi_indonesian_len-5000-6000 data_files: - split: train path: mr-tydi_indonesian_len-5000-6000/train-* - config_name: mr-tydi_japanese_len-0-500 data_files: - split: train path: mr-tydi_japanese_len-0-500/train-* - config_name: mr-tydi_japanese_len-1000-2000 data_files: - split: train path: mr-tydi_japanese_len-1000-2000/train-* - config_name: mr-tydi_japanese_len-2000-3000 data_files: - split: train path: mr-tydi_japanese_len-2000-3000/train-* - config_name: mr-tydi_japanese_len-3000-4000 data_files: - split: train path: mr-tydi_japanese_len-3000-4000/train-* - config_name: mr-tydi_japanese_len-4000-5000 data_files: - split: train path: mr-tydi_japanese_len-4000-5000/train-* - config_name: mr-tydi_japanese_len-500-1000 data_files: - split: train path: mr-tydi_japanese_len-500-1000/train-* - config_name: mr-tydi_japanese_len-5000-6000 data_files: - split: train path: mr-tydi_japanese_len-5000-6000/train-* - config_name: mr-tydi_japanese_len-6000-7000 data_files: - split: train path: mr-tydi_japanese_len-6000-7000/train-* - config_name: mr-tydi_korean_len-0-500 data_files: - split: train path: mr-tydi_korean_len-0-500/train-* - config_name: mr-tydi_korean_len-1000-2000 data_files: - split: train path: mr-tydi_korean_len-1000-2000/train-* - config_name: mr-tydi_korean_len-2000-3000 data_files: - split: train path: mr-tydi_korean_len-2000-3000/train-* - config_name: mr-tydi_korean_len-3000-4000 data_files: - split: train path: mr-tydi_korean_len-3000-4000/train-* - config_name: mr-tydi_korean_len-500-1000 data_files: - split: train path: mr-tydi_korean_len-500-1000/train-* - config_name: mr-tydi_korean_len-7000-inf data_files: - split: train path: mr-tydi_korean_len-7000-inf/train-* - config_name: mr-tydi_russian_len-0-500 data_files: - split: train path: mr-tydi_russian_len-0-500/train-* - config_name: mr-tydi_russian_len-1000-2000 data_files: - split: train path: mr-tydi_russian_len-1000-2000/train-* - config_name: mr-tydi_russian_len-2000-3000 data_files: - split: train path: mr-tydi_russian_len-2000-3000/train-* - config_name: mr-tydi_russian_len-3000-4000 data_files: - split: train path: mr-tydi_russian_len-3000-4000/train-* - config_name: mr-tydi_russian_len-4000-5000 data_files: - split: train path: mr-tydi_russian_len-4000-5000/train-* - config_name: mr-tydi_russian_len-500-1000 data_files: - split: train path: mr-tydi_russian_len-500-1000/train-* - config_name: mr-tydi_russian_len-5000-6000 data_files: - split: train path: mr-tydi_russian_len-5000-6000/train-* - config_name: mr-tydi_russian_len-6000-7000 data_files: - split: train path: mr-tydi_russian_len-6000-7000/train-* - config_name: mr-tydi_russian_len-7000-inf data_files: - split: train path: mr-tydi_russian_len-7000-inf/train-* - config_name: mr-tydi_swahili_len-0-500 data_files: - split: train path: mr-tydi_swahili_len-0-500/train-* - config_name: mr-tydi_swahili_len-1000-2000 data_files: - split: train path: mr-tydi_swahili_len-1000-2000/train-* - config_name: mr-tydi_swahili_len-2000-3000 data_files: - split: train path: mr-tydi_swahili_len-2000-3000/train-* - config_name: mr-tydi_swahili_len-3000-4000 data_files: - split: train path: mr-tydi_swahili_len-3000-4000/train-* - config_name: mr-tydi_swahili_len-500-1000 data_files: - split: train path: mr-tydi_swahili_len-500-1000/train-* - config_name: mr-tydi_telugu_len-0-500 data_files: - split: train path: mr-tydi_telugu_len-0-500/train-* - config_name: mr-tydi_telugu_len-1000-2000 data_files: - split: train path: mr-tydi_telugu_len-1000-2000/train-* - config_name: mr-tydi_telugu_len-2000-3000 data_files: - split: train path: mr-tydi_telugu_len-2000-3000/train-* - config_name: mr-tydi_telugu_len-3000-4000 data_files: - split: train path: mr-tydi_telugu_len-3000-4000/train-* - config_name: mr-tydi_telugu_len-4000-5000 data_files: - split: train path: mr-tydi_telugu_len-4000-5000/train-* - config_name: mr-tydi_telugu_len-500-1000 data_files: - split: train path: mr-tydi_telugu_len-500-1000/train-* - config_name: mr-tydi_telugu_len-5000-6000 data_files: - split: train path: mr-tydi_telugu_len-5000-6000/train-* - config_name: mr-tydi_thai_len-0-500 data_files: - split: train path: mr-tydi_thai_len-0-500/train-* - config_name: mr-tydi_thai_len-1000-2000 data_files: - split: train path: mr-tydi_thai_len-1000-2000/train-* - config_name: mr-tydi_thai_len-2000-3000 data_files: - split: train path: mr-tydi_thai_len-2000-3000/train-* - config_name: mr-tydi_thai_len-3000-4000 data_files: - split: train path: mr-tydi_thai_len-3000-4000/train-* - config_name: mr-tydi_thai_len-4000-5000 data_files: - split: train path: mr-tydi_thai_len-4000-5000/train-* - config_name: mr-tydi_thai_len-500-1000 data_files: - split: train path: mr-tydi_thai_len-500-1000/train-* - config_name: mr-tydi_thai_len-5000-6000 data_files: - split: train path: mr-tydi_thai_len-5000-6000/train-* - config_name: msmarco_len-0-500 data_files: - split: train path: msmarco_len-0-500/train-* - config_name: msmarco_len-1000-2000 data_files: - split: train path: msmarco_len-1000-2000/train-* - config_name: msmarco_len-2000-3000 data_files: - split: train path: msmarco_len-2000-3000/train-* - config_name: msmarco_len-3000-4000 data_files: - split: train path: msmarco_len-3000-4000/train-* - config_name: msmarco_len-4000-5000 data_files: - split: train path: msmarco_len-4000-5000/train-* - config_name: msmarco_len-500-1000 data_files: - split: train path: msmarco_len-500-1000/train-* - config_name: msmarco_len-5000-6000 data_files: - split: train path: msmarco_len-5000-6000/train-* - config_name: msmarco_len-6000-7000 data_files: - split: train path: msmarco_len-6000-7000/train-* - config_name: msmarco_len-7000-inf data_files: - split: train path: msmarco_len-7000-inf/train-* - config_name: nli_for_simcse_len-0-500 data_files: - split: train path: nli_for_simcse_len-0-500/train-* - config_name: nq_len-0-500 data_files: - split: train path: nq_len-0-500/train-* - config_name: nq_len-500-1000 data_files: - split: train path: nq_len-500-1000/train-* - config_name: pubmed_qa_labeled_len-0-500 data_files: - split: train path: pubmed_qa_labeled_len-0-500/train-* - config_name: squad_len-0-500 data_files: - split: train path: squad_len-0-500/train-* - config_name: squad_len-500-1000 data_files: - split: train path: squad_len-500-1000/train-* - config_name: t2ranking_len-0-500 data_files: - split: train path: t2ranking_len-0-500/train-* - config_name: t2ranking_len-500-1000 data_files: - split: train path: t2ranking_len-500-1000/train-* - config_name: trivia_len-0-500 data_files: - split: train path: trivia_len-0-500/train-* - config_name: trivia_len-1000-2000 data_files: - split: train path: trivia_len-1000-2000/train-* - config_name: trivia_len-500-1000 data_files: - split: train path: trivia_len-500-1000/train-* --- # hotchpotch/bge-m3-data-finetune-unified Mirror of the original `Shitao/bge-m3-data` corpus, repackaged from gzip-compressed JSONL shards into Hugging Face Dataset confi gs stored as Parquet. Content is unchanged; this repo only standardizes the storage format and centralizes all length buckets un der one dataset namespace. By using this dataset, which includes up to seven hard negative texts per query (as in the original bge-m3 training data), you can easily fine-tune retrieval models or other search systems. ## Provenance and conversion - Source: [Shitao/bge-m3-data](https://huggingface.co/datasets/Shitao/bge-m3-data), which hosts the official fine-tuning data fo r the BGE-M3 model. - Formatting changes: Only the container format changed (JSONL → HF Dataset → Parquet on the Hub). Field names and values are id entical to the source. ## Schema Each config keeps the original triplet fields: - `query`: string - `pos`: list of positive passages (strings) - `neg`: list of negative passages (strings) ## Using the dataset ```python from datasets import load_dataset ds = load_dataset("hotchpotch/bge-m3-data-finetune-unified", "miracl_en_len-0-500") row = ds["train"][0] preview = lambda t: t.replace("\n", " ")[:50] + ("..." if len(t) > 50 else "") print("query:", row["query"]) print("pos[0]:", preview(row["pos"][0])) for i, n in enumerate(row["neg"][:3]): # show first 3 negatives print(f"neg[{i}]:", preview(n)) ``` Example output: ``` query: When was quantum field theory developed? pos[0]: History of quantum field theory The third thread i... neg[0]: AdS/CFT correspondence In quantum field theory, on... neg[1]: Condensed matter physics The Sommerfeld model and ... neg[2]: Quantum configuration space In quantum field theor... ``` # Original Dataset Summary This depository contains all the fine-tuning data for the [bge-m3](https://huggingface.co/BAAI/bge-m3) model, including: | Dataset | Language | | --------------- | :----------: | | MS MARCO | English | | NQ | English | | HotpotQA | English | | TriviaQA | English | | SQuAD | English | | COLIEE | English | | PubMedQA | English | | NLI from SimCSE | English | | DuReader | Chinese | | mMARCO-zh | Chinese | | T2Ranking | Chinese | | Law-GPT | Chinese | | cMedQAv2 | Chinese | | NLI-zh | Chinese | | LeCaRDv2 | Chinese | | Mr.TyDi | 11 languages | | MIRACL | 16 languages | | MLDR | 13 languages | Note: The MLDR dataset here is the handled `train` set of the [MLDR dataset](https://huggingface.co/datasets/Shitao/MLDR). For more details, please refer to our [paper](https://arxiv.org/pdf/2402.03216.pdf). # Dataset Structure Each dataset has been split into multiple files according to the tokenized length of the text (tokenizer of bge-m3, i.e. tokenizer of [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)). For example, the MS MARCO dataset has been split into 8 files: `msmarco_len-0-500.jsonl`, `msmarco_len-500-1000.jsonl`, ..., `msmarco_len-6000-7000.jsonl`, `msmarco_len-7000-inf.jsonl`. All the files are in the `jsonl` format. Each line of the file is a json object. The following is an example of the json object: ```python {"query": str, "pos": List[str], "neg":List[str]} ``` # Citation Information ``` @misc{bge-m3, title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu}, year={2024}, eprint={2402.03216}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```
提供机构:
hotchpotch
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作