solyanka
收藏魔搭社区2025-12-05 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/ai-forever/solyanka
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset card for Solyanka
This is a dataset collection of ~10 million weakly-supervised pairs for training text embedding models. Any dataset in collection can be used in SentenceTransformers with an InfoNCE loss.
## Data processing
The initial pool of pairs were deduplified, filtered by length and quality. Most of documents are less than 512 tokens ([FRIDA](https://huggingface.co/ai-forever/FRIDA) tokenizer). Some pairs were filtered by manual rules (e.g. by post votes, rating, views). We applied consistency filtering with specific N for every datasets (refer to E5 [paper](https://arxiv.org/abs/2212.03533)) to discard low quality pairs.
## Datasets
- 9111_questions_qa ([9111-questions](https://huggingface.co/datasets/nyuuzyou/9111-questions))
- fishkinet_posts ([fishkinet-posts](https://huggingface.co/datasets/nyuuzyou/fishkinet-posts))
- habr_qna_qa ([habr_qna](https://huggingface.co/datasets/its5Q/habr_qna))
- habr_qna_title_body ([habr_qna](https://huggingface.co/datasets/its5Q/habr_qna))
- habr_title_text ([habr](https://huggingface.co/datasets/IlyaGusev/habr))
- mail_ru_qa ([otvetmailru-full](https://www.kaggle.com/datasets/atleast6characterss/otvetmailru-full))
- msmarco_en_ru ([mmarco](https://huggingface.co/datasets/unicamp-dl/mmarco))
- msmarco_ru_en ([mmarco](https://huggingface.co/datasets/unicamp-dl/mmarco))
- msmarco_ru_ru ([mmarco](https://huggingface.co/datasets/unicamp-dl/mmarco))
- pikabu_title_text ([pikabu](https://huggingface.co/datasets/IlyaGusev/pikabu))
- ru_sci_bench ([ru_sci_bench](https://huggingface.co/datasets/mlsa-iai-msu-lab/ru_sci_bench))
- stackoverflow_qa ([ru_stackoverflow](https://huggingface.co/datasets/IlyaGusev/ru_stackoverflow))
- stackoverflow_title_body ([ru_stackoverflow](https://huggingface.co/datasets/IlyaGusev/ru_stackoverflow))
- swim_ir_ru_en ([swim-ir-cross-lingual](https://huggingface.co/datasets/nthakur/swim-ir-cross-lingual))
- buriy ([ru_news](https://huggingface.co/datasets/IlyaGusev/ru_news))
- lenta ([ru_news](https://huggingface.co/datasets/IlyaGusev/ru_news))
- ods_tass ([ru_news](https://huggingface.co/datasets/IlyaGusev/ru_news))
- taiga_fontanka ([ru_news](https://huggingface.co/datasets/IlyaGusev/ru_news))
- telegram_contest ([ru_news](https://huggingface.co/datasets/IlyaGusev/ru_news))
- wikiomnia ([wikiomnia](https://huggingface.co/datasets/RussianNLP/wikiomnia))
- xlsum_summary_text ([xlsum](https://huggingface.co/datasets/csebuetnlp/xlsum))
- xlsum_title_text ([xlsum](https://huggingface.co/datasets/csebuetnlp/xlsum))
- yandex_q_qa ([yandex_q_full](https://huggingface.co/datasets/IlyaGusev/yandex_q_full))
- yandex_q_title_body ([yandex_q_full](https://huggingface.co/datasets/IlyaGusev/yandex_q_full))
## License for the Dataset Collection
This dataset collection is provided under the MIT license, except in cases where a specific dataset has a more restrictive license that may limit the use of the data (e.g., licenses that prohibit commercial use or have other restrictions).
## Terms of Use
1. The user assumes responsibility for checking and complying with the terms of the licenses for each of the datasets, links to which are provided above.
2. Use of this collection is permitted only if the source licenses for the datasets allow such use.
3. In cases where a specific dataset has more restrictive terms, those terms take precedence over the MIT license for this collection.
## Language
Russian is primary language, but some datasets contain English for cross-lingual retrieval experiments.
## Authors
- [SaluteDevices](https://sberdevices.ru/) AI for B2C RnD Team.
- Artem Snegirev: [HF profile](https://huggingface.co/artemsnegirev), [Github](https://github.com/artemsnegirev);
- Anna Maksimova [HF profile](https://huggingface.co/anpalmak);
- Aleksandr Abramov: [HF profile](https://huggingface.co/Andrilko), [Github](https://github.com/Ab1992ao), [Kaggle Competitions Master](https://www.kaggle.com/andrilko)
## Citation
...
# Solyanka 数据集卡片
本数据集合集包含约1000万条弱监督样本对,用于训练文本嵌入模型。合集内的任意数据集均可结合InfoNCE损失(InfoNCE loss)函数与SentenceTransformers框架使用。
## 数据处理
初始样本对池已完成去重,并基于长度与质量进行筛选。绝大多数文档的词元(Token)数量低于512(使用[FRIDA](https://huggingface.co/ai-forever/FRIDA)分词器)。部分样本对通过人工规则完成筛选,例如依据帖子点赞数、评分与浏览量。我们针对每个数据集应用了参数为特定N的一致性过滤(详见E5[论文](https://arxiv.org/abs/2212.03533)),以剔除低质量样本对。
## 数据集列表
- 9111_questions_qa ([9111-questions](https://huggingface.co/datasets/nyuuzyou/9111-questions))
- fishkinet_posts ([fishkinet-posts](https://huggingface.co/datasets/nyuuzyou/fishkinet-posts))
- habr_qna_qa ([habr_qna](https://huggingface.co/datasets/its5Q/habr_qna))
- habr_qna_title_body ([habr_qna](https://huggingface.co/datasets/its5Q/habr_qna))
- habr_title_text ([habr](https://huggingface.co/datasets/IlyaGusev/habr))
- mail_ru_qa ([otvetmailru-full](https://www.kaggle.com/datasets/atleast6characterss/otvetmailru-full))
- msmarco_en_ru ([mmarco](https://huggingface.co/datasets/unicamp-dl/mmarco))
- msmarco_ru_en ([mmarco](https://huggingface.co/datasets/unicamp-dl/mmarco))
- msmarco_ru_ru ([mmarco](https://huggingface.co/datasets/unicamp-dl/mmarco))
- pikabu_title_text ([pikabu](https://huggingface.co/datasets/IlyaGusev/pikabu))
- ru_sci_bench ([ru_sci_bench](https://huggingface.co/datasets/mlsa-iai-msu-lab/ru_sci_bench))
- stackoverflow_qa ([ru_stackoverflow](https://huggingface.co/datasets/IlyaGusev/ru_stackoverflow))
- stackoverflow_title_body ([ru_stackoverflow](https://huggingface.co/datasets/IlyaGusev/ru_stackoverflow))
- swim_ir_ru_en ([swim-ir-cross-lingual](https://huggingface.co/datasets/nthakur/swim-ir-cross-lingual))
- buriy ([ru_news](https://huggingface.co/datasets/IlyaGusev/ru_news))
- lenta ([ru_news](https://huggingface.co/datasets/IlyaGusev/ru_news))
- ods_tass ([ru_news](https://huggingface.co/datasets/IlyaGusev/ru_news))
- taiga_fontanka ([ru_news](https://huggingface.co/datasets/IlyaGusev/ru_news))
- telegram_contest ([ru_news](https://huggingface.co/datasets/IlyaGusev/ru_news))
- wikiomnia ([wikiomnia](https://huggingface.co/datasets/RussianNLP/wikiomnia))
- xlsum_summary_text ([xlsum](https://huggingface.co/datasets/csebuetnlp/xlsum))
- xlsum_title_text ([xlsum](https://huggingface.co/datasets/csebuetnlp/xlsum))
- yandex_q_qa ([yandex_q_full](https://huggingface.co/datasets/IlyaGusev/yandex_q_full))
- yandex_q_title_body ([yandex_q_full](https://huggingface.co/datasets/IlyaGusev/yandex_q_full))
## 数据集合集许可协议
本数据集合集采用MIT许可协议发布,除非特定数据集附带更严格的许可条款,可能限制数据使用(例如禁止商业使用或存在其他限制的许可)。
## 使用条款
1. 用户需自行负责核查并遵守上述各数据集的许可条款,相关链接已在文中提供。
2. 仅当数据集的源许可允许时,方可使用本合集。
3. 若特定数据集存在更严格的使用条款,则该条款优先于本合集的MIT许可。
## 语言说明
本数据集以俄语为主要语言,但部分数据集包含英文内容,用于跨语言检索实验。
## 作者
- [SaluteDevices](https://sberdevices.ru/) B2C研发人工智能团队
- 阿尔乔姆·斯涅吉列夫:[Hugging Face主页](https://huggingface.co/artemsnegirev)、[Github主页](https://github.com/artemsnegirev)
- 安娜·马克西莫娃:[Hugging Face主页](https://huggingface.co/anpalmak)
- 亚历山大·阿布拉莫夫:[Hugging Face主页](https://huggingface.co/Andrilko)、[Github主页](https://github.com/Ab1992ao)、[Kaggle竞赛大师](https://www.kaggle.com/andrilko)
## 引用
...
提供机构:
maas
创建时间:
2025-05-26



