1M Wiki Corpus
收藏arXiv2025-09-30 收录
下载链接:
https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了来自维基百科的100万个句子,主要用于训练。此外,该数据集还与无监督的SICKR数据集结合使用,以便在无监督对比学习任务中训练句子嵌入。
This dataset contains 1 million sentences sourced from Wikipedia, and is primarily used for training. Furthermore, it is used in conjunction with the unsupervised SICKR dataset to train sentence embeddings in unsupervised contrastive learning tasks.
提供机构:
Hugging Face



