embeddings-pre-training
收藏魔搭社区2026-01-09 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/lightonai/embeddings-pre-training
下载链接
链接失效反馈官方服务:
资源简介:
This large-scale dataset is designed for pre-training state-of-the-art text embedding models. It primarily contains diverse, contrastive data in English.
***
## Dataset Structure
The dataset includes the following columns:
* `query`: The input text.
* `document`: The corresponding document text.
* `index`: A unique identifier for each row.
* `drop`: A boolean value indicating whether a row should be excluded during pre-training.
* `duplicate`: If not `null`, this contains the `index` of a row with a duplicate query and document. If a row has multiple duplicate, the min(indexes duplicates) should be used.
***
## Recommended Usage
For optimal model pre-training, it is recommended to use the subset of data where the `drop` column is `False` and the `duplicate` column is `null`. The complete dataset, including rows marked for dropping and duplication, is provided to allow for the incremental improvement and analysis of the data cleaning process. Work in progress.
```sql
SELECT index, query, document
FROM lightonai/embeddings-pre-training
WHERE NOT DROP AND DUPLICATE IS NULL
```
Each dataset is a distinct configuration within `lightonai/embeddings-pre-training`. To load a specific dataset you will need to specify the configuration and the split:
```python
from datasets import load_dataset
dataset load_dataset(
"lightonai/embeddings-pre-training",
"wikihow",
split="train",
)
```
| Dataset | MGTE Training | Language | Source |
| :---------------------------------- | :-----------: | :------------ | :----- |
| agnews | ✅ | English | [st](https://huggingface.co/datasets/sentence-transformers/agnews) |
| altlex | ✅ | English | [st](https://huggingface.co/datasets/sentence-transformers/altlex) |
| amazon_qa | ✅ | English | [nomic](https://huggingface.co/datasets/nomic-ai/nomic-embed-unsupervised-data) |
| amazon_reviews | ✅ | English | [st](https://huggingface.co/datasets/sentence-transformers/amazon-reviews) |
| arxiv_title_abstract | ✅ | English | [universetdb](https://huggingface.co/datasets/UniverseTBD/arxiv-abstracts-large) |
| beir_dbpedia | ✅ | English | [beir](https://huggingface.co/datasets/BeIR/dbpedia-entity) |
| biorxiv_title_abstract | ✅ | English | [laion](https://huggingface.co/datasets/laion/biorXiv_metadata) |
| cnn_dailymail | ✅ | English | [st](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| codesearchnet | | English | [st](https://huggingface.co/datasets/sentence-transformers/codesearchnet) |
| msmarco | ✅ | English | [microsoft](https://huggingface.co/datasets/microsoft/ms_marco) |
| cc_news_fr | ✅ | French | [intfloat](https://huggingface.co/datasets/intfloat/multilingual_cc_news) |
| cc_news_en | ✅ | English | [nomic](https://huggingface.co/datasets/nomic-ai/nomic-embed-unsupervised-data) |
| eli5 | | English | [st](https://huggingface.co/datasets/sentence-transformers/eli5) |
| gooaq_qa | ✅ | English | [st](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| hermes | | English | [teknium](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |
| medrxiv_title_abstract | ✅ | English | [mteb](https://huggingface.co/datasets/mteb/raw_medrxiv) |
| nllb_eng_fra | | Cross lingual | [allenai](https://huggingface.co/datasets/allenai/nllb) |
| npr | ✅ | English | [st](https://huggingface.co/datasets/sentence-transformers/npr) |
| paq | | English | [st](https://huggingface.co/datasets/sentence-transformers/paq) |
| reddit | ✅ | English | [st](https://huggingface.co/datasets/sentence-transformers/reddit) |
| reddit_body_comment | ✅ | English | [hf](https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_submissions), [pushshift](https://huggingface.co/datasets/fddemarco/pushshift-reddit-comments) |
| s2orc_abstract_citation | ✅ | English | [st](https://huggingface.co/datasets/sentence-transformers/s2orc) |
| s2orc_citation_titles | ✅ | English | [st](https://huggingface.co/datasets/sentence-transformers/s2orc) |
| s2orc_title_abstract | ✅ | English | [st](https://huggingface.co/datasets/sentence-transformers/s2orc) |
| sentence_compression | | English | [st](https://huggingface.co/datasets/sentence-transformers/sentence-compression) |
| simplewiki | | English | [st](https://huggingface.co/datasets/sentence-transformers/simple-wiki) |
| stackexchange_body_body | | English | [st](https://huggingface.co/datasets/sentence-transformers/stackexchange-duplicates) |
| stackexchange_duplicate_questions | | English | [st](https://huggingface.co/datasets/sentence-transformers/stackexchange-duplicates) |
| stackexchange_qa | ✅ | English | [flax](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_title_best_voted_answer_jsonl) |
| stackexchange_title_body | ✅ | English | [flax](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_title_best_voted_answer_jsonl) |
| stackoverflow_title_body | ✅ | English | [flax](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_title_best_voted_answer_jsonl) |
| webfaq_eng | | English | [padas-lab](https://huggingface.co/datasets/PaDaS-Lab/webfaq) |
| webfaq_fra | | French | [padas-lab](https://huggingface.co/datasets/PaDaS-Lab/webfaq) |
| wikihow | ✅ | English | [st](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| wikipedia | ✅ | English | [wikimedia](https://huggingface.co/datasets/wikimedia/wikipedia) |
| yahoo_answer | | English | [st](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| yahoo_qa | ✅ | English | [st](https://huggingface.co/datasets/sentence-transformers/yahoo-answers/viewer/title-answer-pair) |
| yahoo_question_body | ✅ | English | [st](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
If you would like to contribute to this dataset, message me at raphael.sourty@lighton.ai
本大规模数据集专为预训练当前顶尖的文本嵌入模型(text embedding models)打造,主要包含多样化的英文对比样本数据。
***
## 数据集结构
本数据集包含以下字段:
* `query`:输入文本。
* `document`:对应的文档文本。
* `index`:每一行数据的唯一标识符。
* `drop`:布尔值,用于指示预训练阶段是否应排除该行数据。
* `duplicate`:若不为空,则包含查询与文档均重复的某一行的`index`。若一行存在多个重复项,应使用索引最小的重复项的`index`。
***
## 推荐用法
为实现最优的模型预训练效果,建议使用`drop`列为`False`且`duplicate`列为空的数据集子集。本数据集提供完整版本(包含标记为待排除及重复的行),以便逐步优化并分析数据清洗流程,当前仍处于开发阶段。
可通过如下SQL语句筛选可用数据:
sql
SELECT index, query, document
FROM lightonai/embeddings-pre-training
WHERE NOT DROP AND DUPLICATE IS NULL
所有数据集均为`lightonai/embeddings-pre-training`下的独立配置项。若需加载特定数据集,需指定其配置名称与拆分方式,示例代码如下:
python
from datasets import load_dataset
dataset = load_dataset(
"lightonai/embeddings-pre-training",
"wikihow",
split="train",
)
| 数据集名称 | MGTE训练(MGTE Training) | 语言 | 数据源 |
| :---------------------------------- | :-----------------------: | :------------ | :----- |
| agnews | ✅ | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/agnews) |
| altlex | ✅ | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/altlex) |
| amazon_qa | ✅ | 英语 | [nomic](https://huggingface.co/datasets/nomic-ai/nomic-embed-unsupervised-data) |
| amazon_reviews | ✅ | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/amazon-reviews) |
| arxiv_title_abstract | ✅ | 英语 | [universetdb](https://huggingface.co/datasets/UniverseTBD/arxiv-abstracts-large) |
| beir_dbpedia | ✅ | 英语 | [beir](https://huggingface.co/datasets/BeIR/dbpedia-entity) |
| biorxiv_title_abstract | ✅ | 英语 | [laion](https://huggingface.co/datasets/laion/biorXiv_metadata) |
| cnn_dailymail | ✅ | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| codesearchnet | | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/codesearchnet) |
| msmarco | ✅ | 英语 | [microsoft](https://huggingface.co/datasets/microsoft/ms_marco) |
| cc_news_fr | ✅ | 法语 | [intfloat](https://huggingface.co/datasets/intfloat/multilingual_cc_news) |
| cc_news_en | ✅ | 英语 | [nomic](https://huggingface.co/datasets/nomic-ai/nomic-embed-unsupervised-data) |
| eli5 | | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/eli5) |
| gooaq_qa | ✅ | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| hermes | | 英语 | [teknium](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |
| medrxiv_title_abstract | ✅ | 英语 | [mteb](https://huggingface.co/datasets/mteb/raw_medrxiv) |
| nllb_eng_fra | | 跨语言 | [allenai](https://huggingface.co/datasets/allenai/nllb) |
| npr | ✅ | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/npr) |
| paq | | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/paq) |
| reddit | ✅ | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/reddit) |
| reddit_body_comment | ✅ | 英语 | [hf](https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_submissions), [pushshift](https://huggingface.co/datasets/fddemarco/pushshift-reddit-comments) |
| s2orc_abstract_citation | ✅ | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/s2orc) |
| s2orc_citation_titles | ✅ | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/s2orc) |
| s2orc_title_abstract | ✅ | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/s2orc) |
| sentence_compression | | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/sentence-compression) |
| simplewiki | | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/simple-wiki) |
| stackexchange_body_body | | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/stackexchange-duplicates) |
| stackexchange_duplicate_questions | | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/stackexchange-duplicates) |
| stackexchange_qa | ✅ | 英语 | [flax](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_title_best_voted_answer_jsonl) |
| stackexchange_title_body | ✅ | 英语 | [flax](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_title_best_voted_answer_jsonl) |
| stackoverflow_title_body | ✅ | 英语 | [flax](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_title_best_voted_answer_jsonl) |
| webfaq_eng | | 英语 | [padas-lab](https://huggingface.co/datasets/PaDaS-Lab/webfaq) |
| webfaq_fra | | 法语 | [padas-lab](https://huggingface.co/datasets/PaDaS-Lab/webfaq) |
| wikihow | ✅ | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| wikipedia | ✅ | 英语 | [wikimedia](https://huggingface.co/datasets/wikimedia/wikipedia) |
| yahoo_answer | | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| yahoo_qa | ✅ | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/yahoo-answers/viewer/title-answer-pair) |
| yahoo_question_body | ✅ | 英语 | [st](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
若您希望为本数据集贡献内容,请发送邮件至raphael.sourty@lighton.ai与我联系。
提供机构:
maas
创建时间:
2025-08-19



