KvaytG/en-ru-parallel-10m
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KvaytG/en-ru-parallel-10m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
- ru
tags:
- translation
- machine-translation
- parallel-corpus
- nlp
- en-ru
size_categories:
- 10M<n<100M
dataset_info:
features:
- name: text_en
dtype: string
- name: text_ru
dtype: string
- name: score
dtype: float32
splits:
- name: train
num_examples: 10000000
---
# en-ru-parallel-10m
**10 million** English-Russian parallel sentence pairs, filtered and ranked using LaBSE (Model2Vec).
## Dataset Description
This dataset contains **10,000,000** English-Russian parallel sentence pairs. Unlike datasets that only focus on the absolute top-scoring pairs (which often contain very short or repetitive strings), this corpus captures the **central distribution** of a massive crawled collection, ensuring a balance between high semantic alignment and linguistic diversity.
## Dataset Sources
| Source | Link | License |
|:-----------------|:-----------------------------------------------------------------------------|:------------------------------|
| **UNPC** | [OPUS - UNPC](https://opus.nlpl.eu/datasets/UNPC?pair=en&ru) | UN (Public Domain/Permission) |
| **MultiUN** | [OPUS - MultiUN](https://opus.nlpl.eu/datasets/MultiUN?pair=en&ru) | UN (Public Domain/Permission) |
| **GlobalVoices** | [OPUS - GlobalVoices](https://opus.nlpl.eu/datasets/GlobalVoices?pair=en&ru) | CC-BY |
| **Tatoeba** | [Tatoeba Downloads](https://tatoeba.org/en/downloads) | CC-BY 2.0 FR |
| **bible-uedin** | [OPUS - bible-uedin](https://opus.nlpl.eu/datasets/bible-uedin?pair=en&ru) | CC0 |
## Dataset Summary
The dataset was derived from a large-scale raw parallel corpus through the following pipeline:
1. **Heuristic Cleaning**: Initial filtering of noisy data, fixing encoding issues, and removing clearly mismatched pairs.
2. **Deduplication**: Removing identical pairs to ensure training diversity.
3. **Semantic Scoring**:
* Utilized **LaBSE** (Language-Agnostic BERT Sentence Embedding).
* To achieve high performance, we used the **Model2Vec** version of LaBSE (`labse_m2v_300`), which utilizes static embeddings and PCA (300 dims) for extremely fast inference without significant loss in alignment quality.
* Cosine similarity scores were calculated for every pair.
4. **Distribution-based Filtering**:
* The entire corpus was sorted by the LaBSE score.
* Instead of taking the top 10M (which often over-represents identical "perfect" matches or very short sentences), we **trimmed the top and bottom tails** of the distribution.
* The final 10,000,000 pairs represent the **middle slice**, providing stable, high-quality translations ranging from approximately **0.8648** to **0.5197** similarity scores.
## Languages
- **English** (`en`)
- **Russian** (`ru`)
## Data Fields
| Column | Type | Description |
|-----------|---------|---------------------------------------------------------------|
| `text_en` | string | English sentence. |
| `text_ru` | string | Russian sentence. |
| `score` | float32 | LaBSE cosine similarity score. Truncated to 4 decimal places. |
## Data Splits
| Split | Number of examples |
|---------|--------------------|
| `train` | 10,000,000 |
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("KvaytG/en-ru-parallel-10m", split="train")
```
## License
This dataset is released under the **Apache License 2.0**.
## Citation
```bibtex
@misc{kvaytg_en_ru_parallel_10m,
author = {KvaytG},
title = {10M Balanced English-Russian Parallel Corpus (Middle Slice)},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Datasets},
url = {https://huggingface.co/datasets/KvaytG/en-ru-parallel-10m},
note = {10M parallel pairs representing the middle distribution of OPUS en-ru data. Filtered via model2vec LaBSE (300 dims) and ranked by cosine similarity.}
}
```
提供机构:
KvaytG



