KaLM-embedding-finetuning-data
收藏魔搭社区2026-01-08 更新2025-11-15 收录
下载链接:
https://modelscope.cn/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data
下载链接
链接失效反馈官方服务:
资源简介:
*The pretraining dataset is available at this link: [HIT-TMG/KaLM-embedding-pretrain-data](https://huggingface.co/datasets/HIT-TMG/KaLM-embedding-pretrain-data).*
## Languages
English, Chinese, Multilingual
## Dataset Structure
Each in datasets is in the following format:
- query, `string`, one query per sample
- pos, `list[string]`, usually containing one positive example
- neg, `list[string]`, usually containing seven negative examples
## Dataset Summary
All these datasets have been preprocessed and can be used for finetuning your embedding models.
| Source | Type | Categ. | Language | Pairs | Pairs(filtered) |
| :--- | :--- | :--- | :--- | :--- | :--- |
| [CodeFeedback](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) | Retrieval | s2p | en | 50000 | 49090 |
| [ELI5](https://huggingface.co/datasets/rusano/ELI5_custom) | Retrieval | s2p | en | 100000 | 76408 |
| [ExpertQA](https://github.com/chaitanyamalaviya/ExpertQA) | Retrieval | s2p | en | 1261 | 1252 |
| [GooAQ](https://github.com/allenai/gooaq) | Retrieval | s2p | en | 50000 | 49833 |
| [MEDI2BGE](https://hf.co/datasets/GritLM/MEDI2BGE) | Retrieval | s2p | en | 100000 | 71790 |
| [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) | Retrieval | s2p | en | 40000 | 38623 |
| [PAQ](https://huggingface.co/datasets/sentence-transformers/paq) | Retrieval | s2p | en | 50000 | 49849 |
| [PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA) | Retrieval | s2p | en | 80000 | 79954 |
| [SearchQA](https://huggingface.co/datasets/kyunghyuncho/search_qa) | Retrieval | s2p | en | 10000 | 9988 |
| [arxiv_qa](https://huggingface.co/datasets/TitanMLData/arxiv_qa) | Retrieval | s2p | en | 23397 | 17927 |
| [CC-News](https://huggingface.co/datasets/intfloat/multilingual_cc_news) | Retrieval | s2p | en | 30000 | 28246 |
| [TREC-COVID](https://huggingface.co/datasets/irds/cord19_trec-covid) | Retrieval | s2p | en | 50000 | 48517 |
| [DBpedia-Entity](https://huggingface.co/datasets/BeIR/dbpedia-entity-generated-queries) | Retrieval | s2p | en | 100000 | 96792 |
| [ESCI](https://huggingface.co/datasets/tasksource/esci) | Retrieval | s2p | en | 30000 | 26043 |
| [FEVER](https://huggingface.co/datasets/maxzoech/fever) | Retrieval | s2p | en | 87855 | 87216 |
| [FiQA](https://huggingface.co/datasets/irds/beir_fiqa_train) | Retrieval | s2p | en | 5490 | 4689 |
| [HotpotQA](https://huggingface.co/datasets/hotpotqa/hotpot_qa) | Retrieval | s2p | en | 184057 | 150153 |
| [MLDR](https://huggingface.co/datasets/Shitao/MLDR) | Retrieval | s2p | en | 41434 | 31097 |
| [MSMARCO](https://huggingface.co/datasets/Tevatron/msmarco-passage) | Retrieval | s2p | en | 175133 | 174190 |
| [MSMARCO-v2](https://huggingface.co/datasets/mteb/msmarco-v2) | Retrieval | s2p | en | 277144 | 258617 |
| [NFCorpus](https://huggingface.co/datasets/BeIR/nfcorpus-generated-queries) | Retrieval | s2p | en | 10824 | 10471 |
| [rag-dataset-12000](https://huggingface.co/datasets/neural-bridge/rag-dataset-12000) | Retrieval | s2p | en | 9590 | 9272 |
| [SciFact](https://huggingface.co/datasets/Tevatron/scifact) | Retrieval | s2p | en | 809 | 794 |
| [SQuAD 2.0](https://huggingface.co/datasets/rajpurkar/squad_v2) | Retrieval | s2p | en | 130217 | 125816 |
| [TriviaQA](https://huggingface.co/datasets/multi-train/emb-triviaqa-train) | Retrieval | s2p | en | 52886 | 44442 |
| [WebGPT Comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons) | Retrieval | s2p | en | 19242 | 18924 |
| [Natural Questions](https://huggingface.co/datasets/Tevatron/wikipedia-nq) | Retrieval | s2p | en | 58622 | 56377 |
| [Yahoo Answers](https://huggingface.co/datasets/sentence-transformers/yahoo-answers) | Retrieval | s2p | en | 30000 | 21724 |
| [CQADupStack](http://nlp.cis.unimelb.edu.au/resources/cqadupstack/) | Retrieval | s2p | en | 24045 | 7356 |
| [ContractNLI](https://huggingface.co/datasets/kiddothe2b/contract-nli) | STS | s2s | en | 3195 | 628 |
| [MultiNLI](https://huggingface.co/datasets/SetFit/mnli) | STS | s2s | en | 64674 | 63701 |
| [NLLB](https://huggingface.co/datasets/breakend/nllb-multi-domain) | STS | s2s | en | 36000 | 26504 |
| [Quora](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) | STS | s2s | en | 92674 | 89558 |
| [WikiAnswers](https://huggingface.co/datasets/multi-train/WikiAnswers_1107) | STS | s2s | en | 50000 | 47686 |
| [SimCSE NLI](https://huggingface.co/datasets/JeremiahZ/simcse_sup_nli) | STS | s2s | en | 252397 | 217099 |
| [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) | STS | s2s | en | 24686 | 16480 |
| [arXiv](https://huggingface.co/datasets/mteb/raw_arxiv) | Classfication | s2s, p2s | en | 15000 | 14529 |
| [Biorxiv](https://huggingface.co/datasets/mteb/raw_biorxiv) | Classfication | s2s, p2s | en | 6862 | 6787 |
| [Medrxiv](https://huggingface.co/datasets/mteb/raw_medrxiv) | Classfication | s2s, p2s | en | 2012 | 1999 |
| [Reddit-Clustering](https://github.com/UKPLab/TWEAC-qa-agent-selection/tree/master/data/reddit/train) | Classfication | s2s | en | 128000 | 25600 |
| [Reddit-Clustering-P2P](https://huggingface.co/datasets/sentence-transformers/reddit-title-body) | Classfication | p2s | en | 12704958 | 42480 |
| [Stackexchange-Clustering](https://github.com/UKPLab/TWEAC-qa-agent-selection/tree/master/data/stackexchange/train) | Classfication | s2s | en | 1014826 | 50530 |
| [Stackexchange-Clustering-P2P](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_title_body_jsonl) | Classfication | p2s | en | 25333327 | 48800 |
| [TwentyNewsgroups-Clustering](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) | Classfication | s2s | en | 11314 | 6233 |
| [AmazonPolarity](https://huggingface.co/datasets/mteb/amazon_polarity) | Classfication | s2s | en | 10000 | 9007 |
| [IMDB](https://huggingface.co/datasets/mteb/imdb) | Classfication | s2s | en | 10000 | 8575 |
| [banking77](https://huggingface.co/datasets/mteb/banking77) | Classfication | s2s | en | 10000 | 9937 |
| [EmotionClassification](https://huggingface.co/datasets/mteb/emotion) | Classfication | s2s | en | 10000 | 10000 |
| [TweetSentimentExtraction](https://huggingface.co/datasets/mteb/tweet_sentiment_extraction) | Classfication | s2s | en | 10000 | 10000 |
| [ToxicConversations](https://huggingface.co/datasets/mteb/toxic_conversations_50k) | Classfication | s2s | en | 7916 | 7800 |
| [AdvertiseGen](https://huggingface.co/datasets/shibing624/AdvertiseGen) | Retrieval | s2p | zh | 20000 | 17526 |
| [CHEF](https://www.luge.ai/#/luge/dataDetail?id=44) | Retrieval | s2p | zh | 4952 | 4824 |
| [ChatMed-Dataset](https://huggingface.co/datasets/michaelwzhu/ChatMed_Consult_Dataset) | Retrieval | s2p | zh | 20000 | 18608 |
| [CMRC 2018](https://huggingface.co/datasets/erhwenkuo/squad-cmrc2018-zhtw) | Retrieval | s2p | zh | 10000 | 9753 |
| [DRCD](https://huggingface.co/datasets/voidful/DRCD) | Retrieval | s2p | zh | 5000 | 4714 |
| [LCSTS](https://huggingface.co/datasets/hugcyp/LCSTS) | Retrieval | s2p | zh | 20000 | 19535 |
| [LIMA](https://huggingface.co/datasets/paralym/lima-chinese) | Retrieval | s2p | zh | 2058 | 1991 |
| [Multi-CPR](https://github.com/Alibaba-NLP/Multi-CPR) | Retrieval | s2p | zh | 287881 | 234587 |
| [PAWS-X (zh)](https://huggingface.co/datasets/C-MTEB/PAWSX) | Retrieval | s2p | zh | 49401 | 19289 |
| [RefGPT](https://github.com/sufengniu/RefGPT/blob/main/README_EN.md) | Retrieval | s2p | zh | 50000 | 49896 |
| [T2Ranking](https://huggingface.co/datasets/THUIR/T2Ranking) | Retrieval | s2p | zh | 199412 | 188606 |
| [THUCNews](https://huggingface.co/datasets/SirlyDreamer/THUCNews) | Retrieval | s2p | zh | 20000 | 19288 |
| [UMETRIP-QA](https://www.luge.ai/#/luge/dataDetail?id=62) | Retrieval | s2p | zh | 2647 | 2537 |
| [WebCPM](https://github.com/thunlp/WebCPM) | Retrieval | s2p | zh | 1605 | 1602 |
| [cCOVID-News](https://www.datafountain.cn/competitions/424/datasets) | Retrieval | s2p | zh | 5000 | 4727 |
| [cMedQA-V2.0](https://huggingface.co/datasets/wangrongsheng/cMedQA-V2.0) | Retrieval | s2p | zh | 223851 | 88109 |
| [CSL](https://huggingface.co/datasets/neuclir/csl) | Retrieval | s2p | zh | 20000 | 19945 |
| [DuReader](https://huggingface.co/datasets/sentence-transformers/dureader) | Retrieval | s2p | zh | 80416 | 79229 |
| [DuReader_checklist](https://huggingface.co/datasets/luozhouyang/dureader) | Retrieval | s2p | zh | 99992 | 97764 |
| [law-gpt](https://huggingface.co/datasets/sentence-transformers/law-gpt) | Retrieval | s2p | zh | 500 | 500 |
| [lawzhidao](https://www.heywhale.com/mw/dataset/5e953ca8e7ec38002d02fca7/content) | Retrieval | s2p | zh | 8000 | 6784 |
| [mMARCO (zh)](https://huggingface.co/datasets/unicamp-dl/mmarco) | Retrieval | s2p | zh | 400000 | 379870 |
| [retrieval_data_llm](https://huggingface.co/datasets/infgrad/retrieval_data_llm) | Retrieval | s2p | zh | 32768 | 32551 |
| [webqa](https://huggingface.co/datasets/suolyer/webqa) | Retrieval | s2p | zh | 5000 | 4988 |
| [AFQMC](https://huggingface.co/datasets/C-MTEB/AFQMC) | STS | s2s | zh | 4041 | 3876 |
| [ATEC](https://huggingface.co/datasets/C-MTEB/ATEC) | STS | s2s | zh | 62477 | 11387 |
| [BQ](https://huggingface.co/datasets/C-MTEB/BQ) | STS | s2s | zh | 100000 | 10000 |
| [CAIL2019-SCM](https://github.com/china-ai-law-challenge/CAIL2019/tree/master/scm) | STS | s2s | zh | 5102 | 648 |
| [CINLID](https://www.luge.ai/#/luge/dataDetail?id=39) | STS | s2s | zh | 5000 | 2883 |
| [ChineseSTS](https://github.com/IAdmireu/ChineseSTS) | STS | s2s | zh | 2500 | 2497 |
| [CMNLI](https://huggingface.co/datasets/fenffef/cmnli) | STS | s2s | zh | 125356 | 119029 |
| [nli_zh](https://huggingface.co/datasets/shibing624/nli_zh) | STS | s2s | zh | 218887 | 185787 |
| [OCNLI](https://huggingface.co/datasets/Fred666/ocnli) | STS | s2s | zh | 13464 | 11937 |
| [QBQTC](https://github.com/CLUEbenchmark/QBQTC/tree/main) | STS | s2s | zh | 51620 | 47223 |
| [SimCLUE](https://github.com/CLUEbenchmark/SimCLUE) | STS | s2s | zh | 344038 | 290699 |
| [XNLI (zh)](https://huggingface.co/datasets/xnli) | STS | s2s | zh | 80000 | 74252 |
| [CSL](https://huggingface.co/datasets/neuclir/csl) | Classfication | s2s, p2s | zh | 15000 | 12249 |
| [THUCNews](https://huggingface.co/datasets/SirlyDreamer/THUCNews) | Classfication | s2s | zh | 10000 | 9690 |
| [TNews](https://huggingface.co/datasets/fenffef/tnews) | Classfication | s2s | zh | 10000 | 6762 |
| [JDReview](https://huggingface.co/datasets/C-MTEB/JDReview-classification) | Classfication | s2s | zh | 1232 | 1232 |
| [IFlyTek](https://huggingface.co/datasets/fenffef/iflytek) | Classfication | s2s | zh | 10000 | 8221 |
| [OnlineShopping](https://huggingface.co/datasets/C-MTEB/OnlineShopping-classification) | Classfication | s2s | zh | 7852 | 7600 |
| [Waimai](https://huggingface.co/datasets/C-MTEB/waimai-classification) | Classfication | s2s | zh | 7384 | 7376 |
| [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) | Retrieval | s2p | multilingual | 30000 | 26292 |
| [MIRACL](https://huggingface.co/datasets/sentence-transformers/miracl) | Retrieval | s2p | multilingual | 40151 | 39946 |
| [Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) | Retrieval | s2p | multilingual | 48729 | 46997 |
| [PAWS-X](https://huggingface.co/datasets/maximedb/paws-x-all) | STS | s2s | multilingual | 128435 | 128398 |
| [AmazonReviews](https://huggingface.co/datasets/mteb/amazon_reviews_multi) | Classfication | s2s | multilingual | 10000 | 7721 |
| [AmazonCounterfactual](https://huggingface.co/datasets/mteb/amazon_counterfactual) | Classfication | s2s | multilingual | 10000 | 8323 |
| [MultilingualSentiment](https://huggingface.co/datasets/mteb/multilingual-sentiment-classification) | Classfication | s2s | multilingual | 10000 | 9804 |
| [Amazon Massive Intent](https://huggingface.co/datasets/mteb/amazon_massive_intent) | Classfication | s2s | multilingual | 10000 | 7832 |
| [AmazonMassiveScenario](https://huggingface.co/datasets/mteb/amazon_massive_scenario) | Classfication | s2s | multilingual | 10000 | 7078 |
| [MTOPDomain](https://huggingface.co/datasets/mteb/mtop_domain) | Classfication | s2s | multilingual | 10000 | 9610 |
| [MTOPIntent](https://huggingface.co/datasets/mteb/mtop_intent) | Classfication | s2s | multilingual | 10000 | 7952 |
## Citation
If you find these datasets useful, please consider giving a star and citation.
```
@misc{zhao2025kalmembeddingv2,
title={KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model},
author={Xinping Zhao and Xinshuo Hu and Zifei Shan and Shouzheng Huang and Yao Zhou and Xin Zhang and Zetian Sun and Zhenyu Liu and Dongfang Li and Xinyuan Wei and Youcheng Pan and Yang Xiang and Meishan Zhang and Haofen Wang and Jun Yu and Baotian Hu and Min Zhang},
year={2025},
eprint={2506.20923},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.20923},
}
@misc{hu2025kalmembedding,
title={KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model},
author={Xinshuo Hu and Zifei Shan and Xinping Zhao and Zetian Sun and Zhenyu Liu and Dongfang Li and Shaolin Ye and Xinyuan Wei and Qian Chen and Baotian Hu and Haofen Wang and Jun Yu and Min Zhang},
year={2025},
eprint={2501.01028},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.01028},
}
```
提供机构:
maas
创建时间:
2025-11-01



