lightonai/embeddings-pre-training-curated
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/lightonai/embeddings-pre-training-curated
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: index
dtype: int64
- name: query
dtype: string
- name: document
dtype: string
- name: similarity
dtype: float32
configs:
- config_name: agnews
data_files: agnews-*
- config_name: altlex
data_files: altlex-*
- config_name: amazon_qa
data_files: amazon_qa-*
- config_name: amazon_reviews
data_files: amazon_reviews-*
- config_name: arxiv_title_abstract
data_files: arxiv_title_abstract-*
- config_name: beir_dbpedia
data_files: beir_dbpedia-*
- config_name: biorxiv_title_abstract
data_files: biorxiv_title_abstract-*
- config_name: cc_news_en
data_files: ccnews_en-*
- config_name: cnn_dailymail
data_files: cnn_dailymail-*
- config_name: fw_edu
data_files: fw-edu-*
- config_name: gooaq_qa
data_files: gooaq_qa-*
- config_name: wikipedia_hlp_cm
data_files: hlp_wikipedia_cm*
- config_name: wikipedia_hlp_dl
data_files: hlp_wikipedia_dl*
- config_name: medrxiv_title_abstract
data_files: medrxiv_title_abstract-*
- config_name: msmarco
data_files: msmarco-*
- config_name: mtp
data_files: mtp-*
- config_name: npr
data_files: npr-*
- config_name: paq
data_files: paq-*
- config_name: quora
data_files: quora-*
- config_name: reddit
data_files: reddit-*
- config_name: reddit_body_comment
data_files: reddit_body_comment-*
- config_name: s2orc_abstract_citation
data_files: s2orc_abstract_citation-*
- config_name: s2orc_citation_titles
data_files: s2orc_citation_titles-*
- config_name: s2orc_title_abstract
data_files: s2orc_title_abstract-*
- config_name: stackexchange_body_body
data_files: stackexchange_body_body-*
- config_name: stackexchange_duplicate_questions
data_files: stackexchange_duplicate_questions-*
- config_name: stackexchange_qa
data_files: stackexchange_qa-*
- config_name: stackexchange_title_body
data_files: stackexchange_title_body-*
- config_name: stackoverflow_title_body
data_files: stackoverflow_title_body-*
- config_name: wikianswers
data_files: wikianswers-*
- config_name: wikihow
data_files: wikihow-*
- config_name: yahoo_answer
data_files: yahoo_answer-*
- config_name: yahoo_qa
data_files: yahoo_qa-*
- config_name: yahoo_question_body
data_files: yahoo_question_body-*
size_categories:
- 100M<n<1B
language:
- en
tags:
- text-embeddings
- text-retrieval
- pre-training
- mgte
---
# Embeddings pre-training curated data
This dataset is the **English subset** of [`lightonai/embeddings-pre-training`](https://huggingface.co/datasets/lightonai/embeddings-pre-training), assembled to reproduce the English data recipe described in the [mGTE technical report](https://arxiv.org/abs/2407.19669) (Zhang et al., 2024).
The [mGTE paper](https://arxiv.org/abs/2407.19669) describes the data sources used to train the GTE family of multilingual text embedding and reranking models, but does not release the data itself. This dataset is our reconstruction of the English portion of that recipe, curated as part of a research effort to understand how data composition affects retrieval model quality.
For more information please check our [blogpost](https://huggingface.co/blog/lightonai/denseon-lateon).
> For the full multilingual collection (50+ subsets across multiple languages), see the parent dataset: [`lightonai/embeddings-pre-training`](https://huggingface.co/datasets/lightonai/embeddings-pre-training)
---
## Licensing
**This dataset is not openly licensed.** Each source retains its original license. We do not relicense any data. Users are responsible for verifying that their intended use complies with the license terms of each individual source before downloading or using this data. The "Original Source" column in the tables below links to where license information can be found.
---
## Dataset Structure
Each row is a text pair with the following columns:
| Column | Type | Description |
|:-------|:-----|:------------|
| `index` | `int64` | Row identifier, inherited from the parent `embeddings-pre-training` dataset |
| `query` | `string` | The input text |
| `document` | `string` | The corresponding document text |
| `similarity` | `float32` | Query–document relevance score from a cross-encoder reranker (`mxbai-rerank-large-v2`) |
> **Note on schema vs parent dataset.** The parent `lightonai/embeddings-pre-training` additionally carries `drop` (bool) and `duplicate` (int64) columns produced by the per-source filter pipeline and MD5 deduplication. Those annotations have already been **applied** to produce `embeddings-pre-training-curated` (see "Curation & Filtering" below), so they are no longer present in the shipped parquet files. If you need access to the unfiltered pool with the raw annotations, go back to the parent dataset.
---
## Quick Start
```python
from datasets import load_dataset
# Load a specific subset
dataset = load_dataset(
"lightonai/embeddings-pre-training-curated",
"msmarco",
split="train",
)
```
Every row in `embeddings-pre-training-curated` has already passed the recommended curation
pipeline, so no post-filter is required. If you want to be stricter on
semantic relevance, raise the `similarity` floor:
```python
dataset = dataset.filter(lambda x: x["similarity"] >= 5.0)
```
---
## Curation & Filtering
`embeddings-pre-training-curated` is derived from `lightonai/embeddings-pre-training` by
applying, in this order:
1. **Source-aware rule-based filters** — a per-source pipeline of up
to 18 filters (policy boilerplate, HTML artifacts, bad control
chars, non-target scripts, language identification via FastText,
uppercase / numeric ratios, Google 1T unigram log-probability,
repeated-uncommon-word, minimum token count, strict
allow-listed character set). Rows are annotated with a boolean
`drop` flag; all rows with `drop = True` are removed here.
2. **MD5 deduplication** — an MD5 hash of `query + " " + document`
marks every row beyond the first occurrence of its hash with the
`duplicate` index column pointing at the canonical row. All rows
with `duplicate IS NOT NULL` are removed here.
3. **Cross-encoder relevance scoring** — every remaining
query–document pair is scored by
[`mxbai-rerank-large-v2`](https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v2)
into the `similarity` column.
4. **Similarity threshold** — for every subset **except `fw_edu`**
we keep pairs with `similarity >= 3.0`.
5. **Self-pair removal** — the residual case
`query == document` (identical strings) is removed. This catches
self-pair rows that MD5 dedup does not flag (dedup only finds
cross-row collisions, not rows whose query equals its own
document).
The SQL-equivalent filter applied to every standard subset is:
```sql
SELECT index, query, document, similarity
FROM lightonai/embeddings-pre-training
WHERE NOT drop
AND duplicate IS NULL
AND similarity >= 3.0
AND query <> document
```
### Per-source retention (sampled)
Retention ratios measured on one parquet shard per subset:
| Subset | Raw rows | Kept | Retention |
|---|---:|---:|---:|
| `agnews` | 1 157 745 | 564 258 | 48.7 % |
| `altlex` | 110 708 | 83 053 | 75.0 % |
| `amazon_qa` | 1 095 290 | 761 984 | 69.6 % |
| `biorxiv_title_abstract` | 283 550 | 275 247 | 97.1 % |
| `arxiv_title_abstract` (shard 0/5) | 399 898 | 372 315 | 93.1 % |
The wide retention spread reflects intrinsic source quality rather
than filter aggressiveness: curated scientific abstracts lose almost
nothing, while noisier web-crawled news lose ~half.
### Special case: `fw_edu` (FineWeb-Edu)
Because `fw_edu` is produced by an upstream
retrieval-common-crawl pipeline (see
[`orionweller/contrastive-pretraining`](https://huggingface.co/datasets/orionweller/contrastive-pretraining))
that already cleans and deduplicates at the page level, applying our
surface-rule filter or a second MD5 dedup would be both redundant and
prohibitively expensive at ~400 M rows. For `fw_edu` only:
- **No rule-based filter applied** (`drop = False` on every row).
- **No MD5 dedup applied** (`duplicate = NULL` on every row).
- **Cross-encoder-only curation.** Instead of the `similarity >= 3.0`
floor used on every other subset, we keep the **top ~34 % of pairs
per shard** by cross-encoder similarity. The effective absolute
similarity floor varies between ~10.6 and ~11.1 across shards
(versus `3.0` for every other subset).
### Special case: the Atlas HLP Wikipedia splits
The two subsets `wikipedia_hlp_cm` and `wikipedia_hlp_dl` (10 M rows
each, from `facebookresearch/atlas`) are **passed through
untouched**: no rule-based filter, no MD5 dedup, and no cross-encoder
scoring (their `similarity` column is a placeholder zero, preserved
only for schema consistency). These are the Atlas paragraph-linking
pairs as published.
---
## Subsets
**34 subsets** | **1 235 files** | **~517 GB** total
### News & Media
| Subset | Files | Size | Original Source |
|:-------|------:|-----:|:---------------|
| `agnews` | 1 | 0.10 GB | [sentence-transformers/agnews](https://huggingface.co/datasets/sentence-transformers/agnews) |
| `cc_news_en` | 2 | 0.41 GB | [nomic-ai/nomic-embed-unsupervised-data](https://huggingface.co/datasets/nomic-ai/nomic-embed-unsupervised-data) |
| `cnn_dailymail` | 3 | 0.68 GB | [sentence-transformers/embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| `npr` | 3 | 0.53 GB | [sentence-transformers/npr](https://huggingface.co/datasets/sentence-transformers/npr) |
### Scientific & Academic
| Subset | Files | Size | Original Source |
|:-------|------:|-----:|:---------------|
| `arxiv_title_abstract` | 5 | 1.11 GB | [UniverseTBD/arxiv-abstracts-large](https://huggingface.co/datasets/UniverseTBD/arxiv-abstracts-large) |
| `biorxiv_title_abstract` | 1 | 0.26 GB | [laion/biorXiv_metadata](https://huggingface.co/datasets/laion/biorXiv_metadata) |
| `medrxiv_title_abstract` | 1 | 0.18 GB | [mteb/raw_medrxiv](https://huggingface.co/datasets/mteb/raw_medrxiv) |
| `s2orc_abstract_citation` | 185 | 34.36 GB | [sentence-transformers/s2orc](https://huggingface.co/datasets/sentence-transformers/s2orc) |
| `s2orc_citation_titles` | 20 | 3.47 GB | [sentence-transformers/s2orc](https://huggingface.co/datasets/sentence-transformers/s2orc) |
| `s2orc_title_abstract` | 63 | 15.94 GB | [sentence-transformers/s2orc](https://huggingface.co/datasets/sentence-transformers/s2orc) |
### QA & Information Retrieval
| Subset | Files | Size | Original Source |
|:-------|------:|-----:|:---------------|
| `amazon_qa` | 1 | 0.15 GB | [nomic-ai/nomic-embed-unsupervised-data](https://huggingface.co/datasets/nomic-ai/nomic-embed-unsupervised-data) |
| `gooaq_qa` | 2 | 0.50 GB | [sentence-transformers/embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| `msmarco` | 3 | 0.91 GB | [microsoft/ms_marco](https://huggingface.co/datasets/microsoft/ms_marco) |
| `paq` | 75 | 21.95 GB | [sentence-transformers/paq](https://huggingface.co/datasets/sentence-transformers/paq) |
| `quora` | 1 | < 0.01 GB | [nomic-ai/nomic-embed-unsupervised-data](https://huggingface.co/datasets/nomic-ai/nomic-embed-unsupervised-data) |
| `yahoo_answer` | 1 | 0.27 GB | [sentence-transformers/embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| `yahoo_qa` | 2 | 0.28 GB | [sentence-transformers/embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| `yahoo_question_body` | 1 | 0.10 GB | [sentence-transformers/embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
### Reviews & Commerce
| Subset | Files | Size | Original Source |
|:-------|------:|-----:|:---------------|
| `amazon_reviews` | 33 | 8.59 GB | [sentence-transformers/amazon-reviews](https://huggingface.co/datasets/sentence-transformers/amazon-reviews) |
### Social & Forum
| Subset | Files | Size | Original Source |
|:-------|------:|-----:|:---------------|
| `reddit` | 188 | 36.86 GB | [sentence-transformers/reddit](https://huggingface.co/datasets/sentence-transformers/reddit) |
| `reddit_body_comment` | 45 | 11.84 GB | [HuggingFaceGECLM/REDDIT_submissions](https://huggingface.co/datasets/HuggingFaceGECLM/REDDIT_submissions) |
| `stackexchange_body_body` | 1 | 0.04 GB | [sentence-transformers/embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| `stackexchange_duplicate_questions` | 1 | 0.01 GB | [sentence-transformers/embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| `stackexchange_qa` | 10 | 2.18 GB | [sentence-transformers/embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| `stackexchange_title_body` | 10 | 2.24 GB | [sentence-transformers/embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| `stackoverflow_title_body` | 42 | 7.54 GB | [sentence-transformers/embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
### Encyclopedia & Reference
| Subset | Files | Size | Original Source |
|:-------|------:|-----:|:---------------|
| `beir_dbpedia` | 4 | 0.49 GB | [BeIR/dbpedia-entity](https://huggingface.co/datasets/BeIR/dbpedia-entity) |
| `wikianswers` | 41 | 0.75 GB | [sentence-transformers/embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| `wikihow` | 1 | 0.02 GB | [sentence-transformers/embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data) |
| `wikipedia_hlp_cm` | 1 | 4.73 GB | [facebookresearch/atlas](https://github.com/facebookresearch/atlas) |
| `wikipedia_hlp_dl` | 1 | 4.81 GB | [facebookresearch/atlas](https://github.com/facebookresearch/atlas) |
### NLP & Paraphrase
| Subset | Files | Size | Original Source |
|:-------|------:|-----:|:---------------|
| `altlex` | 1 | 0.02 GB | [sentence-transformers/altlex](https://huggingface.co/datasets/sentence-transformers/altlex) |
| `mtp` | 367 | 98.26 GB | [mGTE paper](https://arxiv.org/abs/2407.19669) (Massive Text Pairs) |
### Web & Education
| Subset | Files | Size | Original Source |
|:-------|------:|-----:|:---------------|
| `fw_edu` | 119 | 257.04 GB | [orionweller/contrastive-pretraining](https://huggingface.co/datasets/orionweller/contrastive-pretraining) (derived from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)) |
---
## Citation
If you use this dataset, please cite both works:
```bibtex
@misc{sourty2025denseonlateon,
title={DenseOn with LateOn: Open State-of-the-Art Single and Multi-Vector Models},
author={Sourty, Raphael and Chaffin, Antoine and Weller, Orion and Demoura, Paulo and Chatelain, Amélie},
year={2026},
howpublished={\url{https://huggingface.co/blog/lightonai/denseon-lateon}},
}
@article{zhang2024mgte,
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and Zhang, Meishan and Li, Wenjie and Zhang, Min},
journal={arXiv preprint arXiv:2407.19669},
year={2024}
}
```
提供机构:
lightonai



