five

msmarco

收藏
魔搭社区2026-01-06 更新2025-02-22 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/msmarco
下载链接
链接失效反馈
官方服务:
资源简介:
# MS MARCO Training Dataset This dataset consists of 4 separate datasets, each using the MS MARCO Queries and passages: * `triplets`: This subset contains triplets of query-id, positive-id, negative-id as provided in `qidpidtriples.train.full.2.tsv.gz` from the MS MARCO Website. The only change is that this dataset has been reshuffled. This dataset can easily be used with an `MultipleNegativesRankingLoss` a.k.a. InfoNCE loss. * `labeled-list`: This subset contains triplets of query-id, doc-ids, labels, i.e. every query is matched with every document from the `triplets` subset, with the labels column containing a list denoting which doc_ids represent positives and which ones represent negatives. * `bert-ensemble-mse`: This subset contains tuples with a score. This score is from the BERT_CAT Ensemble from [Hofstätter et al. 2020](https://zenodo.org/records/4068216), and can easily be used with a `MLELoss` to train an embedding or reranker model via distillation. * `bert-ensemble-margin-mse`: This subset contains triplets with a score, such that the score is `ensemble_score(query, positive) - ensemble_score(query, negative)`, also from the BERT_CAT Ensemble from [Hofstätter et al. 2020](https://zenodo.org/records/4068216). It can easily be used with a `MarginMLELoss` to train an embedding or reranker model via distillation. * `rankgpt4-colbert`: This subset contains a RankGPT4 reranking of the top 100 MS MARCO passages retrieved by ColBERTv2. This ranking was compiled by [Schlatt et. al 2024](https://zenodo.org/records/11147862). * `rankzephyr-colbert`: This subset contains a RankZephyr reranking of the top 100 MS MARCO passages retrieved by ColBERTv2. This ranking was compiled by [Schlatt et. al 2024](https://zenodo.org/records/11147862). For all datasets, the id's can be converted using the `queries` and `corpus` subsets to real texts. ## Dataset Subsets ### `corpus` subset * Columns: "passage_id", "passage" * Column types: `str`, `str` * Examples: ```python { "passage_id": "0", "passage": "The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.", } ``` * Collection strategy: Reading `collection.tar.gz` from MS MARCO. ### `queries` subset * Columns: "query_id", "query" * Column types: `str`, `str` * Examples: ```python { "query_id": "121352", "query": "define extreme", } ``` * Collection strategy: Reading `queries.tar.gz` from MS MARCO. ### `triplets` subset * Columns: "query_id", "positive_id", "negative_id" * Column types: `str`, `str`, `str` * Examples: ```python { "query_id": "395861", "positive_id": "1185464", "negative_id": "6162229", } ``` * Collection strategy: Reading `qidpidtriples.train.full.2.tsv.gz` from MS MARCO and shuffling the dataset rows. ### `labeled-list` subset * Columns: "query_id", "doc_ids", "labels" * Column types: `str`, `List[str]`, `List[int]` * Examples: ```python { "query_id": "100", "doc_ids": ["3837260", "7854412", "4778006", "7929416", "5833477", "2715823", "903728", "1418399", "2544108", "4592808", "3565885", "260356", "5885724", "2976754", "3530456", "903722", "5136237", "6166367", "5372728", "6166373", "1615726", "5909725", "3278290", "570067", "2628703", "3619930", "3282101", "570061", "1442855", "5293099", "3976606", "3542912", "4358422", "4729309", "3542156", "102825", "2141701", "5885727", "1007725", "5137341", "180070", "2107140", "4942724", "3915139", "7417638", "7426645", "393085", "3129231", "4905980", "3181468", "7218730", "7159323", "5071423", "1609775", "3476284", "2876976", "6064616", "2752167", "5833480", "5451115", "6052155", "6551293", "2710795", "3231730", "1111340", "7885924", "2822828", "3034062", "3515232", "987726", "3129232", "4066994", "3680517", "6560480", "4584385", "5786855", "6117953", "8788361", "1960434", "212333", "7596616", "8433601", "3070543", "3282099", "5559299", "4070401", "5728025", "4584386", "8614523", "7452451", "3059713", "6401629", "6226845", "2710798", "458688", "207737", "5947749", "1615249", "5054795", "6646948", "4222935", "570068", "5860279", "8411096", "2882722", "3660786", "4711085", "4895219", "4929884", "5615159", "6845998", "1460725", "4433443", "5833479", "3542152", "2565332", "6311315", "4021935", "2616000", "7274494", "5241413", "6259470", "1488609", "187116", "8269268", "2399643", "2711733", "987725", "8788355", "7162594", "1164463", "5546714", "180065", "8467768", "4732466", "63493", "2595189", "3314126", "7426649", "856238", "7266280", "7745447", "7900007", "5035510", "373356", "2615996", "987731", "2904166", "4021933", "8149937", "5786854", "3542915", "1922026", "2743264", "4021930", "2821183", "3359767", "2686007", "6241706", "2544107", "2565334", "3070862", "5673424", "1868516", "879518", "2710801", "2878133", "5506342", "7279044", "260357", "1418394", "4198047", "8811927", "6447579", "3187998", "8489919", "2876695", "4641223", "5095750", "1366894", "5343128", "4167730", "4041435", "5676056", "6979590", "8763883", "5915554", "5060317", "8214795", "4932622", "4147294", "6546696", "3909088", "3397816", "4592804", "2268176", "328471", "6695311", "4090950", "2605356", "442753", "2978405", "890707", "3712000", "7227702", "1753582", "3582358", "8091295", "2601271", "3417484", "3450889", "3381536", "8788358", "4869670", "2969334", "8584693", "3026231", "4616200", "4967138", "1668186", "4346365", "4040376", "4655172", "6659144", "3241644", "4337017", "6733817", "8488585", "2701398", "987728", "7021467", "4879063", "5449524", "4043058", "7876390", "3708326", "3202726", "6267835", "7452454", "4111901", "4584380", "2898746", "1770226", "5786858", "2904167", "3767056", "3837262", "8696128", "8714806", "5974586", "4770734", "8614528", "6715004", "5559298", "5522820", "4494346", "4802607", "3505959", "4943876", "5762512", "7900010", "7614375", "641869", "611056", "1620088", "7044504", "5903693", "6470341", "5885731", "2411293", "1729708", "2723955", "4684463", "7632692", "7300912", "570062", "5786857", "1729712", "2859721", "224598", "8049838", "8757368", "2553525", "4276769", "3476280", "5673427", "6196257", "3529315", "1042349", "1008571", "604128", "1274276", "6976077", "558781", "1417835", "8746383", "6534823", "2544102", "4892920", "5326560", "3529311", "8288714", "8410908", "7541381", "8276461", "443963", "1418786", "393082", "2876973", "2041868", "4684460", "2553455", "4294336", "1770227", "1396675", "4821482", "341684", "3317707", "7758155", "1680750", "978378", "4641573", "6447578", "3572351", "7074859", "6560473", "2059066", "7681590", "6241703", "1425182", "941495", "4898655", "2710799", "3694312", "2565339", "5886217", "6997080", "570064", "5697987", "4058317", "3059711", "5540787", "4914280", "413609", "8149940", "2828604", "903721", "4130056", "7126261", "4294342", "4357509", "2041870", "3537437", "1274279", "442759", "3574934", "7007240", "2828792", "4040360", "5504280", "7803953", "6668972", "1698637", "4639591", "782719", "2144188", "3562506", "5287734", "6183651", "7048806", "2628701", "3282102", "7428497", "8503034", "7173876", "3910109", "7900005", "2929050", "2422821", "1753368", "4639589", "7098652", "7969224", "1640132", "6182438", "4981517", "478505", "3404202", "2469894", "5422545", "1164461", "563620", "8602235", "6905110", "260355", "1928946", "2970078", "903729", "2943399", "6990940", "4378415", "3488844", "4748532", "2660195", "357356", "429500", "1729469", "6936575", "3837268", "7133186", "4214920", "5372162", "3428653", "5209141", "6117958", "8165720", "6715084", "6220994", "6801444", "4791658", "4778011", "2553534", "3905953", "6102139", "2370329", "6668971", "2828788", "2844459", "2041872", "4270591", "926981", "8165084", "4381190", "7740134", "8592605", "5156554", "1993432", "2904162", "3837261", "4641221", "4609663", "1925807", "2059063", "5168436", "401623", "1854833", "4655167", "4127921", "4584387", "1425176", "7212830", "4045409", "2863533", "6718879", "3278292", "2244466", "5161597", "1164462", "5870980", "7883558", "3129235", "3837265", "7476186", "3161580", "5449523", "519516", "685140", "5343127", "3304414", "7758154", "235363", "5095754", "4112274", "7300909", "8592603", "8035150", "6052157", "6307419", "207739", "6220993", "386576", "1425177", "1709605", "7562137", "3417479", "987733", "1113623", "5885728", "7816675", "5559303", "424024", "7452453", "7300910", "7072423", "3359765", "6990938", "872482", "4892919", "4942267", "987179", "1396676", "4647425", "6026592", "8430156", "8415731", "2059058", "7949973", "8714805", "7160656", "3282107", "6430813", "4624121", "8614526", "6560476", "5904531", "1736796", "2943403", "8614524", "3856628", "5425825", "4301955", "1960428", "4198046", "1319052", "1236547", "6064613", "2544106", "6226846", "3407251", "7101275", "3928646", "4932629", "4641222", "1770224", "2864823", "5559302", "4791657", "1086512", "6385449", "5927021", "7553032", "260359", "3059706", "4592809", "2504367", "5572084", "3231724", "3542151", "3419457", "7460828", "4778008", "6695308", "6285584", "2562236", "5449527", "3083530", "7264931", "5934860", "2615997", "1425180", "6447581", "3474330", "6063973", "903730", "6395414", "8763555", "3841369", "6733815", "2735696", "811438", "3409169", "1735575", "5148194", "4502897", "926980", "3717515", "572995", "903726", "63492", "3059708", "2951653", "6751200", "3951499", "7402067", "6692933", "5559296", "1636630", "6408893", "5483639", "2876704", "4734029", "8091999", "393083", "3529312", "6953126", "2411292", "6904243", "5577720", "21827", "4188071", "3070541", "655643", "4294334", "1922023", "184834", "1909625", "7403766", "6171714", "6615681", "3282108", "3059178", "3220639", "2710796", "8049834", "1534446", "146937", "2904170", "4781986", "1247371", "6042881", "3538804", "1210555", "2020084", "3129233", "5007960", "1922027", "317861", "7226653", "207734", "6733814", "6214585", "5373863", "2272409", "1753576", "5286536", "903723", "2616001", "1620083", "3861147", "442756", "2475844", "3145217", "599694", "1799545", "2553531", "1573123", "4276764", "4674924", "2904164", "6052158", "1735568", "6560481", "808526", "2544103", "2615640", "4940921", "8714801", "3070861", "2598167", "1358786", "2635123", "2977588", "4648965", "1425178", "7888864", "8744226", "4630697", "5372731", "4632610", "6990933", "6510260", "1733784", "519518", "1922019", "1425175", "925506", "5286618", "3523560", "1960052", "6226842", "2735698", "6938823", "3708010", "4774546", "7831509", "2616002", "8138906", "8048048", "4375739", "1113853", "7831797", "8430159", "2617359", "7197980", "890700", "4198045", "1922022", "6480031", "4778005", "5434672", "5106779", "1113854", "8782468", "4778014", "4584382", "2904163", "1319048", "3746541", "7197976", "1729706", "207736", "7700674", "442751", "2876702", "8366443", "5148192", "872477", "2601268", "8426337", "7274672", "6724762", "3282100", "6844846", "2411294", "2437024", "413607", "6214583", "7241863", "5060319", "1733789", "5477181", "3841373", "599686", "1822109", "6016976", "7710638", "2969339", "317867", "3889980", "6202558", "1922028", "3068679", "6267107", "4778010", "1164456", "2949959", "3770664", "8830401", "5447370", "2654076", "3837264", "514105", "1623873", "2667055", "1706867", "7394057", "4914279", "6179258", "7241866", "7606221", "863939", "572998", "6307420", "3278288", "7636863", "1833550", "7300908", "6659143", "903724", "7274675", "2041874", "7274491", "6052159", "6064611", "5315118", "7452457", "1733790", "7085674", "5607469", "6715080", "7528693", "6151450", "6936572", "8291592", "6052154", "2628704", "5094428", "2544101", "5209142", "2448798", "4898658", "447958", "3907349", "3837266", "1729705", "2615998" "5777752", "4641228", "4767985", "1964628", "5559295", "8344779", "1481317", "570063", "5422544", "4793008", "2544105", "7288672", "4767982", "8342090", "7160653", "7300914", "1206739", "5425329", "8325658", "6052156", "6668963", "442755", "2240500", "4647422", "6878782", "4040434", "4358312", "8204234", "8512015", "63496", "7229500", "7182554", "6151448", "5974582", "7173877", "8092000", "7212835", "3199491", "393998", "8035146", "1396680", "2735697", "925501", "1922025", "5885730", "1729713", "2755336", "7138459", "570070", "4021934", "903727", "6268913", "4739388", "7241868", "4030464", "2951657", "5974587", "1332617", "3869948", "8705052", "7266276", "7900253", "1614143", "5422548", "1272014", "7861061", "1998585", "3359677", "3985679", "3359768", "519514", "3841374", "3120523", "4021936", "1729707", "1440281", "558784", "5432868", "3205742", "4632606", "5950757", "8419081", "1733791", "4684461", "8049837", "1586152", "4045410", "6447577", "5521525", "2335443", "3311387", "4045403", "1922020", "7488188", "3713391", "1425179", "2859933", "1941737", "3557852", "6361227", "7226652", "5596636", "6165753", "4021931", "8614521", "3542911", "3010907", "3059705", "545587", "5969529", "6560475", "4378414", "570065", "8426345", "7892254", "3059707", "1922024", "1440190", "8149931", "7197977", "2283513", "2978402", "1137959", "3542917", "1615721", "4294340", "3282103", "7976948", "5227466", "2411296", "3492171", "6560474", "977993", "1846450", "8101033", "4356464", "7887336", "36611", "7203435", "5564079", "224596", "7300911", "5886218", "6165751", "4462443", "4684459", "3417478", "2951654", "3186729", "3239802", "3708321", "7664369", "2107143", "6300526", "3289374", "3059710", "4147291", "4641226", "2274293", "5559300", "3770671", "8398154", "488886", "6241704", "2555898", "3488845", "3630018", "1677477", "4639587", "2601269", "1693298", "5424209", "7816672", "8788359", "3428649", "3231726", "2710794", "4329499", "8115452", "6259463", "6285580", "8605584", "3059709", "4270589", "2555856", "2424358", "4211101", "1922021", "4432101"], "labels": [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], } ``` * Collection strategy: Reading the `triplets` subset and grouping all triplets by query_id. The large majority of queries have exactly 1000 doc_ids, out of which often only 1 is labeled positive. Up to 7 documents are labeled positive per query in the entire subset. ### `bert-ensemble-mse` subset * Columns: "query_id", "passage_id", "score" * Column types: `str`, `str`, `float64` * Examples: ```python { "query_id": "400296", "passage_id": "1540783", "score": 6.624662, } ``` * Collection strategy: Reading the BERT_CAT Ensemble scores from [Hofstätter et al. 2020](https://zenodo.org/records/4068216). ### `bert-ensemble-margin-mse` subset * Columns: "query_id", "positive_id", "negative_id", "score" * Column types: `str`, `str`, `str`, `float64` * Examples: ```python { "query_id": "400296", "positive_id": "1540783", "negative_id": "3518497", "score": 4.028059, } ``` * Collection strategy: Reading the BERT_CAT Ensemble scores from [Hofstätter et al. 2020](https://zenodo.org/records/4068216) and computing `score = pos_score - neg_score` for each triplet. ### `rankgpt4-colbert` subset * Columns: "query_id", "doc_ids" * Column types: `str`, `list[str]` * Examples: ```python { "query_id": "1002990", "doc_ids": ["3227617", "3227618", "2425847", "3290896", "6964111", "6136903", "6136902", "6136909", "2242080", "2425843", "3227616", "3227622", "4433358", "2625224", "1292817", "3151910", "3151908", "1292819", "2597066", "1292822", "2597061", "1292823", "1292821", "2242077", "7869866", "2242076", "6964112", "3227613", "3227614", "3227620", "8466240", "4503976", "2022084", "4503979", "5220703", "4274806", "4274800", "4274805", "4274799", "4274801", "3227621", "4433357", "4760228", "8801589", "4433356", "4274797", "5334021", "5019160", "4784355", "2625226", "4820159", "6136907", "6136908", "8743919", "2625222", "4261266", "2242079", "2242075", "2242078", "4760231", "4760233", "3305593", "6078688", "6136910", "8185538", "4357995", "2276483", "7752", "2104661", "7135886", "3151912", "3526055", "4252749", "4252745", "2731898", "2425844", "4433361", "531164", "3627638", "3627630", "2589697", "4252748", "3208439", "4760234", "2069200", "5024557", "2512795", "2845254", "7051021", "8516705", "3627631", "1629565", "4303606", "8679732", "4228604", "1006454", "4303602", "6136906", "6136905", "4433362"], } ``` * Collection strategy: Reading the `__rankgpt-colbert-2000-sampled-100__msmarco-passage-train-judged.run` file from https://zenodo.org/records/11147862, which were compiled by [Schlatt et al.](https://arxiv.org/abs/2405.07920). ### `rankzephyr-colbert` subset * Columns: "query_id", "doc_ids" * Column types: `str`, `list[str]` * Examples: ```python { "query_id": "1002990", "doc_ids": ["3227618", "3227616", "3227617", "3227622", "2625224", "4433358", "3227621", "4433357", "7869866", "2242079", "2242075", "2242078", "6136907", "2425847", "4433356", "6136905", "6136906", "3227614", "3227613", "3227620", "2242076", "4760228", "2625226", "5334021", "1292823", "4760231", "1292821", "2242077", "2597061", "4433362", "4274805", "1292817", "3151908", "3151910", "2597066", "6136908", "6136902", "3290896", "4820159", "8801589", "4784355", "5019160", "4274800", "4274801", "4274806", "4274797", "6136903", "4760233", "5024557", "2512795", "6964112", "6964111", "2625222", "6078688"4303606", "7051021", "4261266", "6136909", "8466240", "4503976", "7752", "2104661", "7135886", "3208439", "4228604", "8679732", "2022084", "4433361", "430360"4760234", "4252745", "4252748"] } ``` * Collection strategy: Reading the `__rankzephyr-colbert-10000-sampled-100__msmarco-passage-train-judged.run` file from https://zenodo.org/records/11147862, which were compiled by [Schlatt et al.](https://arxiv.org/abs/2405.07920).

# MS MARCO训练数据集 本数据集包含6个独立子数据集,所有子数据集均基于MS MARCO查询与段落构建: * `triplets`(三元组)子集:该子集包含由查询ID、正样本ID、负样本ID组成的三元组,数据源自MS MARCO官网的`qidpidtriples.train.full.2.tsv.gz`文件,仅对数据集进行了重洗牌操作。该数据集可直接配合`MultipleNegativesRankingLoss`(多负样本排序损失,别名InfoNCE损失)使用。 * `labeled-list`(带标签列表)子集:该子集包含由查询ID、文档ID、标签组成的三元组,即每个查询与`triplets`子集中的所有文档进行匹配,标签列以列表形式标注哪些doc_ids为正样本、哪些为负样本。 * `bert-ensemble-mse`(BERT集成均方误差)子集:该子集包含带分数的元组,分数来自[Hofstätter et al. 2020](https://zenodo.org/records/4068216)提出的BERT_CAT集成模型,可直接配合`MLELoss`(最大似然估计损失)用于通过知识蒸馏训练嵌入模型或重排序模型。 * `bert-ensemble-margin-mse`(BERT集成边际均方误差)子集:该子集包含带分数的三元组,分数为`ensemble_score(query, positive) - ensemble_score(query, negative)`,同样源自[Hofstätter et al. 2020](https://zenodo.org/records/4068216)的BERT_CAT集成模型。可直接配合`MarginMLELoss`(边际最大似然估计损失)用于通过知识蒸馏训练嵌入模型或重排序模型。 * `rankgpt4-colbert`(RankGPT4-ColBERT)子集:该子集包含由ColBERTv2检索得到的Top 100个MS MARCO段落的RankGPT4重排序结果,该重排序表由[Schlatt et. al 2024](https://zenodo.org/records/11147862)构建。 * `rankzephyr-colbert`(RankZephyr-ColBERT)子集:该子集包含由ColBERTv2检索得到的Top 100个MS MARCO段落的RankZephyr重排序结果,该重排序表由[Schlatt et. al 2024](https://zenodo.org/records/11147862)构建。 对于所有数据集,可通过`queries`(查询)和`corpus`(语料库)子数据集将ID转换为真实文本。 ## 数据集子集 ### `corpus`(语料库)子集 * 列名:"passage_id"(段落ID)、"passage"(段落文本) * 列类型:`str`(字符串)、`str`(字符串) * 示例: python { "passage_id": "0", "passage": "学术共同体内部的沟通对曼哈顿计划的成功至关重要,其重要性不亚于科学智慧本身。然而,原子研究人员与工程师这项令人瞩目的成就背后,笼罩着一层阴影:他们的成功究竟意味着什么?数十万无辜的生命被抹杀。", } * 采集策略:从MS MARCO获取`collection.tar.gz`文件进行读取。 ### `queries`(查询)子集 * 列名:"query_id"(查询ID)、"query"(查询文本) * 列类型:`str`(字符串)、`str`(字符串) * 示例: python { "query_id": "121352", "query": "定义极端", } * 采集策略:从MS MARCO获取`queries.tar.gz`文件进行读取。 ### `triplets`(三元组)子集 * 列名:"query_id"(查询ID)、"positive_id"(正样本ID)、"negative_id"(负样本ID) * 列类型:`str`(字符串)、`str`(字符串)、`str`(字符串) * 示例: python { "query_id": "395861", "positive_id": "1185464", "negative_id": "6162229", } * 采集策略:从MS MARCO获取`qidpidtriples.train.full.2.tsv.gz`文件,并对数据集行进行重洗牌操作。 ### `labeled-list`(带标签列表)子集 * 列名:"query_id"(查询ID)、"doc_ids"(文档ID列表)、"labels"(标签列表) * 列类型:`str`(字符串)、`List[str]`(字符串列表)、`List[int]`(整数列表) * 示例: python { "query_id": "100", "doc_ids": ["3837260", "7854412", "4778006", ...], "labels": [1, 0, 0, ...], } * 采集策略:读取`triplets`子集并按查询ID对所有三元组进行分组。该数据集中绝大多数查询恰好包含1000个文档ID,其中通常仅有1个被标记为正样本;整个子集中每个查询最多有7个正样本标签。 ### `bert-ensemble-mse`(BERT集成均方误差)子集 * 列名:"query_id"(查询ID)、"passage_id"(段落ID)、"score"(分数) * 列类型:`str`(字符串)、`str`(字符串)、`float64`(64位浮点数) * 示例: python { "query_id": "400296", "passage_id": "1540783", "score": 6.624662, } * 采集策略:读取[Hofstätter et al. 2020](https://zenodo.org/records/4068216)提出的BERT_CAT集成模型的得分数据。 ### `bert-ensemble-margin-mse`(BERT集成边际均方误差)子集 * 列名:"query_id"(查询ID)、"positive_id"(正样本ID)、"negative_id"(负样本ID)、"score"(分数) * 列类型:`str`(字符串)、`str`(字符串)、`str`(字符串)、`float64`(64位浮点数) * 示例: python { "query_id": "400296", "positive_id": "1540783", "negative_id": "3518497", "score": 4.028059, } * 采集策略:读取[Hofstätter et al. 2020](https://zenodo.org/records/4068216)提出的BERT_CAT集成模型的得分数据,并为每个三元组计算`score = pos_score - neg_score`。 ### `rankgpt4-colbert`(RankGPT4-ColBERT)子集 * 列名:"query_id"(查询ID)、"doc_ids"(文档ID列表) * 列类型:`str`(字符串)、`list[str]`(字符串列表) * 示例: python { "query_id": "1002990", "doc_ids": ["3227617", "3227618", "2425847", ...], } * 采集策略:读取https://zenodo.org/records/11147862中的`__rankgpt-colbert-2000-sampled-100__msmarco-passage-train-judged.run`文件,该文件由[Schlatt et. al](https://arxiv.org/abs/2405.07920)构建。 ### `rankzephyr-colbert`(RankZephyr-ColBERT)子集 * 列名:"query_id"(查询ID)、"doc_ids"(文档ID列表) * 列类型:`str`(字符串)、`list[str]`(字符串列表) * 示例: python { "query_id": "1002990", "doc_ids": ["3227618", "3227616", "3227617", ...], } * 采集策略:读取https://zenodo.org/records/11147862中的`__rankzephyr-colbert-10000-sampled-100__msmarco-passage-train-judged.run`文件,该文件由[Schlatt et. al](https://arxiv.org/abs/2405.07920)构建。
提供机构:
maas
创建时间:
2025-02-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作