KaLM-embedding-finetuning-data

Name: KaLM-embedding-finetuning-data
Creator: maas
Published: 2026-01-08 10:08:38
License: 暂无描述

魔搭社区2026-01-08 更新2025-11-15 收录

下载链接：

https://modelscope.cn/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data

下载链接

链接失效反馈

官方服务：

资源简介：

*The pretraining dataset is available at this link: [HIT-TMG/KaLM-embedding-pretrain-data](https://huggingface.co/datasets/HIT-TMG/KaLM-embedding-pretrain-data).* ## Languages English, Chinese, Multilingual ## Dataset Structure Each in datasets is in the following format: - query, `string`, one query per sample - pos, `list[string]`, usually containing one positive example - neg, `list[string]`, usually containing seven negative examples ## Dataset Summary All these datasets have been preprocessed and can be used for finetuning your embedding models. | Source | Type | Categ. | Language | Pairs | Pairs(filtered) | | :--- | :--- | :--- | :--- | :--- | :--- | | [CodeFeedback](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) | Retrieval | s2p | en | 50000 | 49090 | | [ELI5](https://huggingface.co/datasets/rusano/ELI5_custom) | Retrieval | s2p | en | 100000 | 76408 | | [ExpertQA](https://github.com/chaitanyamalaviya/ExpertQA) | Retrieval | s2p | en | 1261 | 1252 | | [GooAQ](https://github.com/allenai/gooaq) | Retrieval | s2p | en | 50000 | 49833 | | [MEDI2BGE](https://hf.co/datasets/GritLM/MEDI2BGE) | Retrieval | s2p | en | 100000 | 71790 | | [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) | Retrieval | s2p | en | 40000 | 38623 | | [PAQ](https://huggingface.co/datasets/sentence-transformers/paq) | Retrieval | s2p | en | 50000 | 49849 | | [PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA) | Retrieval | s2p | en | 80000 | 79954 | | [SearchQA](https://huggingface.co/datasets/kyunghyuncho/search_qa) | Retrieval | s2p | en | 10000 | 9988 | | [arxiv_qa](https://huggingface.co/datasets/TitanMLData/arxiv_qa) | Retrieval | s2p | en | 23397 | 17927 | | [CC-News](https://huggingface.co/datasets/intfloat/multilingual_cc_news) | Retrieval | s2p | en | 30000 | 28246 | | [TREC-COVID](https://huggingface.co/datasets/irds/cord19_trec-covid) | Retrieval | s2p | en | 50000 | 48517 | | [DBpedia-Entity](https://huggingface.co/datasets/BeIR/dbpedia-entity-generated-queries) | Retrieval | s2p | en | 100000 | 96792 | | [ESCI](https://huggingface.co/datasets/tasksource/esci) | Retrieval | s2p | en | 30000 | 26043 | | [FEVER](https://huggingface.co/datasets/maxzoech/fever) | Retrieval | s2p | en | 87855 | 87216 | | [FiQA](https://huggingface.co/datasets/irds/beir_fiqa_train) | Retrieval | s2p | en | 5490 | 4689 | | [HotpotQA](https://huggingface.co/datasets/hotpotqa/hotpot_qa) | Retrieval | s2p | en | 184057 | 150153 | | [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Retrieval | s2p | en | 41434 | 31097 | | [MSMARCO](https://huggingface.co/datasets/Tevatron/msmarco-passage) | Retrieval | s2p | en | 175133 | 174190 | | [MSMARCO-v2](https://huggingface.co/datasets/mteb/msmarco-v2) | Retrieval | s2p | en | 277144 | 258617 | | [NFCorpus](https://huggingface.co/datasets/BeIR/nfcorpus-generated-queries) | Retrieval | s2p | en | 10824 | 10471 | | [rag-dataset-12000](https://huggingface.co/datasets/neural-bridge/rag-dataset-12000) | Retrieval | s2p | en | 9590 | 9272 | | [SciFact](https://huggingface.co/datasets/Tevatron/scifact) | Retrieval | s2p | en | 809 | 794 | | [SQuAD 2.0](https://huggingface.co/datasets/rajpurkar/squad_v2) | Retrieval | s2p | en | 130217 | 125816 | | [TriviaQA](https://huggingface.co/datasets/multi-train/emb-triviaqa-train) | Retrieval | s2p | en | 52886 | 44442 | | [WebGPT Comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons) | Retrieval | s2p | en | 19242 | 18924 | | [Natural Questions](https://huggingface.co/datasets/Tevatron/wikipedia-nq) | Retrieval | s2p | en | 58622 | 56377 | | [Yahoo Answers](https://huggingface.co/datasets/sentence-transformers/yahoo-answers) | Retrieval | s2p | en | 30000 | 21724 | | [CQADupStack](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | Retrieval | s2p | en | 24045 | 7356 | | [ContractNLI](https://huggingface.co/datasets/kiddothe2b/contract-nli) | STS | s2s | en | 3195 | 628 | | [MultiNLI](https://huggingface.co/datasets/SetFit/mnli) | STS | s2s | en | 64674 | 63701 | | [NLLB](https://huggingface.co/datasets/breakend/nllb-multi-domain) | STS | s2s | en | 36000 | 26504 | | [Quora](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) | STS | s2s | en | 92674 | 89558 | | [WikiAnswers](https://huggingface.co/datasets/multi-train/WikiAnswers_1107) | STS | s2s | en | 50000 | 47686 | | [SimCSE NLI](https://huggingface.co/datasets/JeremiahZ/simcse_sup_nli) | STS | s2s | en | 252397 | 217099 | | [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) | STS | s2s | en | 24686 | 16480 | | [arXiv](https://huggingface.co/datasets/mteb/raw_arxiv) | Classfication | s2s, p2s | en | 15000 | 14529 | | [Biorxiv](https://huggingface.co/datasets/mteb/raw_biorxiv) | Classfication | s2s, p2s | en | 6862 | 6787 | | [Medrxiv](https://huggingface.co/datasets/mteb/raw_medrxiv) | Classfication | s2s, p2s | en | 2012 | 1999 | | [Reddit-Clustering](https://github.com/UKPLab/TWEAC-qa-agent-selection/tree/master/data/reddit/train) | Classfication | s2s | en | 128000 | 25600 | | [Reddit-Clustering-P2P](https://huggingface.co/datasets/sentence-transformers/reddit-title-body) | Classfication | p2s | en | 12704958 | 42480 | | [Stackexchange-Clustering](https://github.com/UKPLab/TWEAC-qa-agent-selection/tree/master/data/stackexchange/train) | Classfication | s2s | en | 1014826 | 50530 | | [Stackexchange-Clustering-P2P](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_title_body_jsonl) | Classfication | p2s | en | 25333327 | 48800 | | [TwentyNewsgroups-Clustering](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) | Classfication | s2s | en | 11314 | 6233 | | [AmazonPolarity](https://huggingface.co/datasets/mteb/amazon_polarity) | Classfication | s2s | en | 10000 | 9007 | | [IMDB](https://huggingface.co/datasets/mteb/imdb) | Classfication | s2s | en | 10000 | 8575 | | [banking77](https://huggingface.co/datasets/mteb/banking77) | Classfication | s2s | en | 10000 | 9937 | | [EmotionClassification](https://huggingface.co/datasets/mteb/emotion) | Classfication | s2s | en | 10000 | 10000 | | [TweetSentimentExtraction](https://huggingface.co/datasets/mteb/tweet_sentiment_extraction) | Classfication | s2s | en | 10000 | 10000 | | [ToxicConversations](https://huggingface.co/datasets/mteb/toxic_conversations_50k) | Classfication | s2s | en | 7916 | 7800 | | [AdvertiseGen](https://huggingface.co/datasets/shibing624/AdvertiseGen) | Retrieval | s2p | zh | 20000 | 17526 | | [CHEF](https://www.luge.ai/#/luge/dataDetail?id=44) | Retrieval | s2p | zh | 4952 | 4824 | | [ChatMed-Dataset](https://huggingface.co/datasets/michaelwzhu/ChatMed_Consult_Dataset) | Retrieval | s2p | zh | 20000 | 18608 | | [CMRC 2018](https://huggingface.co/datasets/erhwenkuo/squad-cmrc2018-zhtw) | Retrieval | s2p | zh | 10000 | 9753 | | [DRCD](https://huggingface.co/datasets/voidful/DRCD) | Retrieval | s2p | zh | 5000 | 4714 | | [LCSTS](https://huggingface.co/datasets/hugcyp/LCSTS) | Retrieval | s2p | zh | 20000 | 19535 | | [LIMA](https://huggingface.co/datasets/paralym/lima-chinese) | Retrieval | s2p | zh | 2058 | 1991 | | [Multi-CPR](https://github.com/Alibaba-NLP/Multi-CPR) | Retrieval | s2p | zh | 287881 | 234587 | | [PAWS-X (zh)](https://huggingface.co/datasets/C-MTEB/PAWSX) | Retrieval | s2p | zh | 49401 | 19289 | | [RefGPT](https://github.com/sufengniu/RefGPT/blob/main/README_EN.md) | Retrieval | s2p | zh | 50000 | 49896 | | [T2Ranking](https://huggingface.co/datasets/THUIR/T2Ranking) | Retrieval | s2p | zh | 199412 | 188606 | | [THUCNews](https://huggingface.co/datasets/SirlyDreamer/THUCNews) | Retrieval | s2p | zh | 20000 | 19288 | | [UMETRIP-QA](https://www.luge.ai/#/luge/dataDetail?id=62) | Retrieval | s2p | zh | 2647 | 2537 | | [WebCPM](https://github.com/thunlp/WebCPM) | Retrieval | s2p | zh | 1605 | 1602 | | [cCOVID-News](https://www.datafountain.cn/competitions/424/datasets) | Retrieval | s2p | zh | 5000 | 4727 | | [cMedQA-V2.0](https://huggingface.co/datasets/wangrongsheng/cMedQA-V2.0) | Retrieval | s2p | zh | 223851 | 88109 | | [CSL](https://huggingface.co/datasets/neuclir/csl) | Retrieval | s2p | zh | 20000 | 19945 | | [DuReader](https://huggingface.co/datasets/sentence-transformers/dureader) | Retrieval | s2p | zh | 80416 | 79229 | | [DuReader_checklist](https://huggingface.co/datasets/luozhouyang/dureader) | Retrieval | s2p | zh | 99992 | 97764 | | [law-gpt](https://huggingface.co/datasets/sentence-transformers/law-gpt) | Retrieval | s2p | zh | 500 | 500 | | [lawzhidao](https://www.heywhale.com/mw/dataset/5e953ca8e7ec38002d02fca7/content) | Retrieval | s2p | zh | 8000 | 6784 | | [mMARCO (zh)](https://huggingface.co/datasets/unicamp-dl/mmarco) | Retrieval | s2p | zh | 400000 | 379870 | | [retrieval_data_llm](https://huggingface.co/datasets/infgrad/retrieval_data_llm) | Retrieval | s2p | zh | 32768 | 32551 | | [webqa](https://huggingface.co/datasets/suolyer/webqa) | Retrieval | s2p | zh | 5000 | 4988 | | [AFQMC](https://huggingface.co/datasets/C-MTEB/AFQMC) | STS | s2s | zh | 4041 | 3876 | | [ATEC](https://huggingface.co/datasets/C-MTEB/ATEC) | STS | s2s | zh | 62477 | 11387 | | [BQ](https://huggingface.co/datasets/C-MTEB/BQ) | STS | s2s | zh | 100000 | 10000 | | [CAIL2019-SCM](https://github.com/china-ai-law-challenge/CAIL2019/tree/master/scm) | STS | s2s | zh | 5102 | 648 | | [CINLID](https://www.luge.ai/#/luge/dataDetail?id=39) | STS | s2s | zh | 5000 | 2883 | | [ChineseSTS](https://github.com/IAdmireu/ChineseSTS) | STS | s2s | zh | 2500 | 2497 | | [CMNLI](https://huggingface.co/datasets/fenffef/cmnli) | STS | s2s | zh | 125356 | 119029 | | [nli_zh](https://huggingface.co/datasets/shibing624/nli_zh) | STS | s2s | zh | 218887 | 185787 | | [OCNLI](https://huggingface.co/datasets/Fred666/ocnli) | STS | s2s | zh | 13464 | 11937 | | [QBQTC](https://github.com/CLUEbenchmark/QBQTC/tree/main) | STS | s2s | zh | 51620 | 47223 | | [SimCLUE](https://github.com/CLUEbenchmark/SimCLUE) | STS | s2s | zh | 344038 | 290699 | | [XNLI (zh)](https://huggingface.co/datasets/xnli) | STS | s2s | zh | 80000 | 74252 | | [CSL](https://huggingface.co/datasets/neuclir/csl) | Classfication | s2s, p2s | zh | 15000 | 12249 | | [THUCNews](https://huggingface.co/datasets/SirlyDreamer/THUCNews) | Classfication | s2s | zh | 10000 | 9690 | | [TNews](https://huggingface.co/datasets/fenffef/tnews) | Classfication | s2s | zh | 10000 | 6762 | | [JDReview](https://huggingface.co/datasets/C-MTEB/JDReview-classification) | Classfication | s2s | zh | 1232 | 1232 | | [IFlyTek](https://huggingface.co/datasets/fenffef/iflytek) | Classfication | s2s | zh | 10000 | 8221 | | [OnlineShopping](https://huggingface.co/datasets/C-MTEB/OnlineShopping-classification) | Classfication | s2s | zh | 7852 | 7600 | | [Waimai](https://huggingface.co/datasets/C-MTEB/waimai-classification) | Classfication | s2s | zh | 7384 | 7376 | | [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) | Retrieval | s2p | multilingual | 30000 | 26292 | | [MIRACL](https://huggingface.co/datasets/sentence-transformers/miracl) | Retrieval | s2p | multilingual | 40151 | 39946 | | [Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) | Retrieval | s2p | multilingual | 48729 | 46997 | | [PAWS-X](https://huggingface.co/datasets/maximedb/paws-x-all) | STS | s2s | multilingual | 128435 | 128398 | | [AmazonReviews](https://huggingface.co/datasets/mteb/amazon_reviews_multi) | Classfication | s2s | multilingual | 10000 | 7721 | | [AmazonCounterfactual](https://huggingface.co/datasets/mteb/amazon_counterfactual) | Classfication | s2s | multilingual | 10000 | 8323 | | [MultilingualSentiment](https://huggingface.co/datasets/mteb/multilingual-sentiment-classification) | Classfication | s2s | multilingual | 10000 | 9804 | | [Amazon Massive Intent](https://huggingface.co/datasets/mteb/amazon_massive_intent) | Classfication | s2s | multilingual | 10000 | 7832 | | [AmazonMassiveScenario](https://huggingface.co/datasets/mteb/amazon_massive_scenario) | Classfication | s2s | multilingual | 10000 | 7078 | | [MTOPDomain](https://huggingface.co/datasets/mteb/mtop_domain) | Classfication | s2s | multilingual | 10000 | 9610 | | [MTOPIntent](https://huggingface.co/datasets/mteb/mtop_intent) | Classfication | s2s | multilingual | 10000 | 7952 | ## Citation If you find these datasets useful, please consider giving a star and citation. ``` @misc{zhao2025kalmembeddingv2, title={KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model}, author={Xinping Zhao and Xinshuo Hu and Zifei Shan and Shouzheng Huang and Yao Zhou and Xin Zhang and Zetian Sun and Zhenyu Liu and Dongfang Li and Xinyuan Wei and Youcheng Pan and Yang Xiang and Meishan Zhang and Haofen Wang and Jun Yu and Baotian Hu and Min Zhang}, year={2025}, eprint={2506.20923}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.20923}, } @misc{hu2025kalmembedding, title={KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model}, author={Xinshuo Hu and Zifei Shan and Xinping Zhao and Zetian Sun and Zhenyu Liu and Dongfang Li and Shaolin Ye and Xinyuan Wei and Qian Chen and Baotian Hu and Haofen Wang and Jun Yu and Min Zhang}, year={2025}, eprint={2501.01028}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2501.01028}, } ```

提供机构：

maas

创建时间：

2025-11-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集