five

hotchpotch/mmarco-hard-negatives-reranker-filtered

收藏
Hugging Face2026-01-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hotchpotch/mmarco-hard-negatives-reranker-filtered
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: arabic-hard-negatives features: - name: query dtype: string - name: pos_text dtype: string - name: negs_text list: string - name: negs_count dtype: int32 - name: pos_score dtype: float32 - name: negs_score list: float32 splits: - name: train num_bytes: 2113494813 num_examples: 349518 download_size: 989078789 dataset_size: 2113494813 - config_name: arabic-hard-negatives-7 features: - name: query dtype: string - name: positive dtype: string - name: negative_1 dtype: string - name: negative_2 dtype: string - name: negative_3 dtype: string - name: negative_4 dtype: string - name: negative_5 dtype: string - name: negative_6 dtype: string - name: negative_7 dtype: string splits: - name: train num_bytes: 1292089603 num_examples: 299044 download_size: 638550242 dataset_size: 1292089603 - config_name: arabic-triplet features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 400217378 num_examples: 349518 download_size: 200344021 dataset_size: 400217378 - config_name: arabic-triplet-10 features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 3464493625 num_examples: 3031778 download_size: 943959375 dataset_size: 3464493625 - config_name: arabic-triplet-all features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 4047699539 num_examples: 3546380 download_size: 1073051129 dataset_size: 4047699539 - config_name: chinese-hard-negatives features: - name: query dtype: string - name: pos_text dtype: string - name: negs_text list: string - name: negs_count dtype: int32 - name: pos_score dtype: float32 - name: negs_score list: float32 splits: - name: train num_bytes: 2216454702 num_examples: 383313 download_size: 1359075674 dataset_size: 2216454702 - config_name: chinese-hard-negatives-7 features: - name: query dtype: string - name: positive dtype: string - name: negative_1 dtype: string - name: negative_2 dtype: string - name: negative_3 dtype: string - name: negative_4 dtype: string - name: negative_5 dtype: string - name: negative_6 dtype: string - name: negative_7 dtype: string splits: - name: train num_bytes: 927271103 num_examples: 370984 download_size: 618463240 dataset_size: 927271103 - config_name: chinese-triplet features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 252510559 num_examples: 383313 download_size: 171058848 dataset_size: 252510559 - config_name: chinese-triplet-10 features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 2455032272 num_examples: 3729432 download_size: 863389567 dataset_size: 2455032272 - config_name: chinese-triplet-all features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 4395417304 num_examples: 6683870 download_size: 1422492995 dataset_size: 4395417304 - config_name: dutch-hard-negatives features: - name: query dtype: string - name: pos_text dtype: string - name: negs_text list: string - name: negs_count dtype: int32 - name: pos_score dtype: float32 - name: negs_score list: float32 splits: - name: train num_bytes: 2174002796 num_examples: 371879 download_size: 1212093655 dataset_size: 2174002796 - config_name: dutch-hard-negatives-7 features: - name: query dtype: string - name: positive dtype: string - name: negative_1 dtype: string - name: negative_2 dtype: string - name: negative_3 dtype: string - name: negative_4 dtype: string - name: negative_5 dtype: string - name: negative_6 dtype: string - name: negative_7 dtype: string splits: - name: train num_bytes: 1091922772 num_examples: 354231 download_size: 652686790 dataset_size: 1091922772 - config_name: dutch-triplet features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 303033800 num_examples: 371879 download_size: 183795674 dataset_size: 303033800 - config_name: dutch-triplet-10 features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 2891961008 num_examples: 3551107 download_size: 923122947 dataset_size: 2891961008 - config_name: dutch-triplet-all features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 4274542464 num_examples: 5258282 download_size: 1287016546 dataset_size: 4274542464 - config_name: english-hard-negatives features: - name: query dtype: string - name: pos_text dtype: string - name: negs_text list: string - name: negs_count dtype: int32 - name: pos_score dtype: float32 - name: negs_score list: float32 splits: - name: train num_bytes: 2324505943 num_examples: 399075 download_size: 1306880603 dataset_size: 2324505943 - config_name: english-hard-negatives-7 features: - name: query dtype: string - name: positive dtype: string - name: negative_1 dtype: string - name: negative_2 dtype: string - name: negative_3 dtype: string - name: negative_4 dtype: string - name: negative_5 dtype: string - name: negative_6 dtype: string - name: negative_7 dtype: string splits: - name: train num_bytes: 1081053381 num_examples: 383872 download_size: 655650453 dataset_size: 1081053381 - config_name: english-triplet features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 296175314 num_examples: 399075 download_size: 182842216 dataset_size: 296175314 - config_name: english-triplet-10 features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 2857984730 num_examples: 3852858 download_size: 923091822 dataset_size: 2857984730 - config_name: english-triplet-all features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 4580123031 num_examples: 6185133 download_size: 1378658598 dataset_size: 4580123031 - config_name: french-hard-negatives features: - name: query dtype: string - name: pos_text dtype: string - name: negs_text list: string - name: negs_count dtype: int32 - name: pos_score dtype: float32 - name: negs_score list: float32 splits: - name: train num_bytes: 2190245299 num_examples: 375562 download_size: 1184166311 dataset_size: 2190245299 - config_name: french-hard-negatives-7 features: - name: query dtype: string - name: positive dtype: string - name: negative_1 dtype: string - name: negative_2 dtype: string - name: negative_3 dtype: string - name: negative_4 dtype: string - name: negative_5 dtype: string - name: negative_6 dtype: string - name: negative_7 dtype: string splits: - name: train num_bytes: 1180087427 num_examples: 351278 download_size: 683217441 dataset_size: 1180087427 - config_name: french-triplet features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 333976677 num_examples: 375562 download_size: 196193611 dataset_size: 333976677 - config_name: french-triplet-10 features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 3127811745 num_examples: 3521407 download_size: 968642087 dataset_size: 3127811745 - config_name: french-triplet-all features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 4280184877 num_examples: 4827864 download_size: 1264534665 dataset_size: 4280184877 - config_name: german-hard-negatives features: - name: query dtype: string - name: pos_text dtype: string - name: negs_text list: string - name: negs_count dtype: int32 - name: pos_score dtype: float32 - name: negs_score list: float32 splits: - name: train num_bytes: 2130821555 num_examples: 362195 download_size: 1191049599 dataset_size: 2130821555 - config_name: german-hard-negatives-7 features: - name: query dtype: string - name: positive dtype: string - name: negative_1 dtype: string - name: negative_2 dtype: string - name: negative_3 dtype: string - name: negative_4 dtype: string - name: negative_5 dtype: string - name: negative_6 dtype: string - name: negative_7 dtype: string splits: - name: train num_bytes: 1098128571 num_examples: 343891 download_size: 658325690 dataset_size: 1098128571 - config_name: german-triplet features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 305643548 num_examples: 362195 download_size: 186057867 dataset_size: 305643548 - config_name: german-triplet-10 features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 2904631870 num_examples: 3443904 download_size: 929962814 dataset_size: 2904631870 - config_name: german-triplet-all features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 4180552773 num_examples: 4964819 download_size: 1265443454 dataset_size: 4180552773 - config_name: indonesian-hard-negatives features: - name: query dtype: string - name: pos_text dtype: string - name: negs_text list: string - name: negs_count dtype: int32 - name: pos_score dtype: float32 - name: negs_score list: float32 splits: - name: train num_bytes: 2167660896 num_examples: 373869 download_size: 1143622995 dataset_size: 2167660896 - config_name: indonesian-hard-negatives-7 features: - name: query dtype: string - name: positive dtype: string - name: negative_1 dtype: string - name: negative_2 dtype: string - name: negative_3 dtype: string - name: negative_4 dtype: string - name: negative_5 dtype: string - name: negative_6 dtype: string - name: negative_7 dtype: string splits: - name: train num_bytes: 1070929143 num_examples: 356143 download_size: 608417256 dataset_size: 1070929143 - config_name: indonesian-triplet features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 297191098 num_examples: 373869 download_size: 171494207 dataset_size: 297191098 - config_name: indonesian-triplet-10 features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 2839015603 num_examples: 3573699 download_size: 861180271 dataset_size: 2839015603 - config_name: indonesian-triplet-all features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 4267409210 num_examples: 5380840 download_size: 1217482223 dataset_size: 4267409210 - config_name: italian-hard-negatives features: - name: query dtype: string - name: pos_text dtype: string - name: negs_text list: string - name: negs_count dtype: int32 - name: pos_score dtype: float32 - name: negs_score list: float32 splits: - name: train num_bytes: 2167846787 num_examples: 373979 download_size: 1204883037 dataset_size: 2167846787 - config_name: italian-hard-negatives-7 features: - name: query dtype: string - name: positive dtype: string - name: negative_1 dtype: string - name: negative_2 dtype: string - name: negative_3 dtype: string - name: negative_4 dtype: string - name: negative_5 dtype: string - name: negative_6 dtype: string - name: negative_7 dtype: string splits: - name: train num_bytes: 1117766793 num_examples: 353540 download_size: 666621653 dataset_size: 1117766793 - config_name: italian-triplet features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 312908464 num_examples: 373979 download_size: 189495198 dataset_size: 312908464 - config_name: italian-triplet-10 features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 2964847787 num_examples: 3545498 download_size: 943341257 dataset_size: 2964847787 - config_name: italian-triplet-all features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 4256926173 num_examples: 5099619 download_size: 1280973836 dataset_size: 4256926173 - config_name: japanese-hard-negatives features: - name: query dtype: string - name: pos_text dtype: string - name: negs_text list: string - name: negs_count dtype: int32 - name: pos_score dtype: float32 - name: negs_score list: float32 splits: - name: train num_bytes: 2144415650 num_examples: 357351 download_size: 1080761827 dataset_size: 2144415650 - config_name: japanese-hard-negatives-7 features: - name: query dtype: string - name: positive dtype: string - name: negative_1 dtype: string - name: negative_2 dtype: string - name: negative_3 dtype: string - name: negative_4 dtype: string - name: negative_5 dtype: string - name: negative_6 dtype: string - name: negative_7 dtype: string splits: - name: train num_bytes: 1203518881 num_examples: 331773 download_size: 648812107 dataset_size: 1203518881 - config_name: japanese-triplet features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 341285758 num_examples: 357351 download_size: 187201236 dataset_size: 341285758 - config_name: japanese-triplet-10 features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 3168329044 num_examples: 3317556 download_size: 922432699 dataset_size: 3168329044 - config_name: japanese-triplet-all features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 4156658204 num_examples: 4354539 download_size: 1159057487 dataset_size: 4156658204 - config_name: spanish-hard-negatives features: - name: query dtype: string - name: pos_text dtype: string - name: negs_text list: string - name: negs_count dtype: int32 - name: pos_score dtype: float32 - name: negs_score list: float32 splits: - name: train num_bytes: 2200508708 num_examples: 381323 download_size: 1188936798 dataset_size: 2200508708 - config_name: spanish-hard-negatives-7 features: - name: query dtype: string - name: positive dtype: string - name: negative_1 dtype: string - name: negative_2 dtype: string - name: negative_3 dtype: string - name: negative_4 dtype: string - name: negative_5 dtype: string - name: negative_6 dtype: string - name: negative_7 dtype: string splits: - name: train num_bytes: 1167774418 num_examples: 356969 download_size: 676069637 dataset_size: 1167774418 - config_name: spanish-triplet features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 330337739 num_examples: 381323 download_size: 194281213 dataset_size: 330337739 - config_name: spanish-triplet-10 features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 3099212505 num_examples: 3581475 download_size: 958736780 dataset_size: 3099212505 - config_name: spanish-triplet-all features: - name: query dtype: string - name: positive dtype: string - name: negative dtype: string splits: - name: train num_bytes: 4307501828 num_examples: 4987393 download_size: 1268895341 dataset_size: 4307501828 configs: - config_name: arabic-hard-negatives data_files: - split: train path: arabic-hard-negatives/train-* - config_name: arabic-hard-negatives-7 data_files: - split: train path: arabic-hard-negatives-7/train-* - config_name: arabic-triplet data_files: - split: train path: arabic-triplet/train-* - config_name: arabic-triplet-10 data_files: - split: train path: arabic-triplet-10/train-* - config_name: arabic-triplet-all data_files: - split: train path: arabic-triplet-all/train-* - config_name: chinese-hard-negatives data_files: - split: train path: chinese-hard-negatives/train-* - config_name: chinese-hard-negatives-7 data_files: - split: train path: chinese-hard-negatives-7/train-* - config_name: chinese-triplet data_files: - split: train path: chinese-triplet/train-* - config_name: chinese-triplet-10 data_files: - split: train path: chinese-triplet-10/train-* - config_name: chinese-triplet-all data_files: - split: train path: chinese-triplet-all/train-* - config_name: dutch-hard-negatives data_files: - split: train path: dutch-hard-negatives/train-* - config_name: dutch-hard-negatives-7 data_files: - split: train path: dutch-hard-negatives-7/train-* - config_name: dutch-triplet data_files: - split: train path: dutch-triplet/train-* - config_name: dutch-triplet-10 data_files: - split: train path: dutch-triplet-10/train-* - config_name: dutch-triplet-all data_files: - split: train path: dutch-triplet-all/train-* - config_name: english-hard-negatives data_files: - split: train path: english-hard-negatives/train-* - config_name: english-hard-negatives-7 data_files: - split: train path: english-hard-negatives-7/train-* - config_name: english-triplet data_files: - split: train path: english-triplet/train-* - config_name: english-triplet-10 data_files: - split: train path: english-triplet-10/train-* - config_name: english-triplet-all data_files: - split: train path: english-triplet-all/train-* - config_name: french-hard-negatives data_files: - split: train path: french-hard-negatives/train-* - config_name: french-hard-negatives-7 data_files: - split: train path: french-hard-negatives-7/train-* - config_name: french-triplet data_files: - split: train path: french-triplet/train-* - config_name: french-triplet-10 data_files: - split: train path: french-triplet-10/train-* - config_name: french-triplet-all data_files: - split: train path: french-triplet-all/train-* - config_name: german-hard-negatives data_files: - split: train path: german-hard-negatives/train-* - config_name: german-hard-negatives-7 data_files: - split: train path: german-hard-negatives-7/train-* - config_name: german-triplet data_files: - split: train path: german-triplet/train-* - config_name: german-triplet-10 data_files: - split: train path: german-triplet-10/train-* - config_name: german-triplet-all data_files: - split: train path: german-triplet-all/train-* - config_name: indonesian-hard-negatives data_files: - split: train path: indonesian-hard-negatives/train-* - config_name: indonesian-hard-negatives-7 data_files: - split: train path: indonesian-hard-negatives-7/train-* - config_name: indonesian-triplet data_files: - split: train path: indonesian-triplet/train-* - config_name: indonesian-triplet-10 data_files: - split: train path: indonesian-triplet-10/train-* - config_name: indonesian-triplet-all data_files: - split: train path: indonesian-triplet-all/train-* - config_name: italian-hard-negatives data_files: - split: train path: italian-hard-negatives/train-* - config_name: italian-hard-negatives-7 data_files: - split: train path: italian-hard-negatives-7/train-* - config_name: italian-triplet data_files: - split: train path: italian-triplet/train-* - config_name: italian-triplet-10 data_files: - split: train path: italian-triplet-10/train-* - config_name: italian-triplet-all data_files: - split: train path: italian-triplet-all/train-* - config_name: japanese-hard-negatives data_files: - split: train path: japanese-hard-negatives/train-* - config_name: japanese-hard-negatives-7 data_files: - split: train path: japanese-hard-negatives-7/train-* - config_name: japanese-triplet data_files: - split: train path: japanese-triplet/train-* - config_name: japanese-triplet-10 data_files: - split: train path: japanese-triplet-10/train-* - config_name: japanese-triplet-all data_files: - split: train path: japanese-triplet-all/train-* - config_name: spanish-hard-negatives data_files: - split: train path: spanish-hard-negatives/train-* - config_name: spanish-hard-negatives-7 data_files: - split: train path: spanish-hard-negatives-7/train-* - config_name: spanish-triplet data_files: - split: train path: spanish-triplet/train-* - config_name: spanish-triplet-10 data_files: - split: train path: spanish-triplet-10/train-* - config_name: spanish-triplet-all data_files: - split: train path: spanish-triplet-all/train-* --- # mMARCO Reranker-Filtered Hard Negatives (Multilingual) ## Overview This dataset is built from [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) (multilingual MS MARCO) triplets for each language subset. For each (query, positive), hard negatives are bundled and then filtered using cross-encoder re-scoring. The goal is to remove negatives that are too strong or incorrect for training. The same procedure is applied to all language subsets. The dataset is published as `mmarco-hard-negatives-reranker-filtered` with config names `{lang}-{variant}`. `{lang}` is the language subset name (e.g., `japanese`), and `{variant}` is one of the following. The pair format is not included in the public release. ### 1) `{lang}-hard-negatives` The filtered hard negatives as-is. Columns: `query: str`, `pos_text: str`, `negs_text: list[str]`, `negs_count: int`, `pos_score: float`, `negs_score: list[float]` ### 2) `{lang}-triplet` For each `(query, pos_text)`, one negative is randomly selected and converted into a `(query, positive, negative)` triplet. Columns: `query: str`, `positive: str`, `negative: str` ### 3) `{lang}-triplet-10` For each `(query, pos_text)`, up to 10 negatives are randomly sampled, and each is expanded into a `(query, positive, negative)` triplet. Columns: `query: str`, `positive: str`, `negative: str` ### 4) `{lang}-triplet-all` All negatives in `negs_text` are expanded into `(query, positive, negative)` triplets. Columns: `query: str`, `positive: str`, `negative: str` ### 5) `{lang}-hard-negatives-7` Only records with at least 7 negatives are kept. Then 7 negatives are randomly selected and stored as `negative_1..negative_7`. Columns: `query: str`, `positive: str`, `negative_1: str`, `negative_2: str`, `negative_3: str`, `negative_4: str`, `negative_5: str`, `negative_6: str`, `negative_7: str` Columns: `query: str`, `positive: str`, `negative_1: str`, `negative_2: str`, `negative_3: str`, `negative_4: str`, `negative_5: str`, `negative_6: str`, `negative_7: str` ## Source data - Dataset: `unicamp-dl/mmarco` - Revision: `refs/convert/parquet` (parquet-converted version) - Target subsets: all language subsets available under `refs/convert/parquet` - Split: partial train Parquet for each language (`{lang}/partial/train/*.parquet` or `{lang}/partial-train/*.parquet`) - Main columns in source: `query`, `positive`, `negative` ## Construction procedure (reproducible processing) The following steps reproduce the dataset. We describe the processing itself rather than local scripts or environments. ### 1. Aggregate triplets into hard-negative bundles 1. Load all partial train Parquet files for each language subset. 2. Keep only rows where `query`, `positive`, and `negative` are all present. 3. Group by `(query, positive)` and deduplicate negatives with a set. 4. For each `(query, positive)`, create a record: - `query`: string - `pos_text`: `positive` - `negs_text`: unique list of negatives for that `(query, positive)` (sorted for determinism) ### 2. Cross-encoder re-scoring Score `(query, text)` pairs using: - Model: `BAAI/bge-reranker-v2-m3` (Cross-Encoder) - Max length: 512 tokens - No quantization or distillation; standard inference in bf16 For each record: 1. Score `(query, pos_text)` → `pos_score` 2. Score `(query, neg)` for each `negs_text` → `negs_score` (same order as `negs_text`) ### 3. Filtering conditions The reranker-score filtering here is implemented with reference to the approach in [ruri-v3-dataset-reranker](https://huggingface.co/datasets/cl-nagoya/ruri-v3-dataset-reranker). Keep a record only if all conditions hold: - `pos_score > 0.3` - keep only negatives with `neg_score < 0.7` - at least 1 negative remains after filtering Save the remaining negative count as `negs_count`. ## Output columns - `query` (string) - `pos_text` (string) - `negs_text` (list[string]) - `negs_count` (int) - `pos_score` (float) - `negs_score` (list[float]) `negs_score` follows the same order as `negs_text`. ## License Follows the original mMARCO license.
提供机构:
hotchpotch
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作