KvaytG/en-ru-parallel-20m
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KvaytG/en-ru-parallel-20m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
language:
- en
- ru
tags:
- translation
- machine-translation
- parallel-corpus
- nlp
- en-ru
size_categories:
- 10M<n<100M
dataset_info:
features:
- name: english
dtype: string
- name: russian
dtype: string
- name: score
dtype: float32
splits:
- name: train
num_examples: 20000000
---
# en-ru-parallel-20m
**20 million** highest-quality English-Russian parallel sentence pairs.
## Dataset Description
This dataset contains **20,000,000** carefully filtered English-Russian parallel sentence pairs. It was created specifically for machine translation, multilingual embedding training, model fine-tuning, and any other NLP tasks that require a large high-quality en-ru parallel corpus.
## Dataset Summary
The corpus was built from **ALL** English-Russian datasets available on [OPUS](https://opus.nlpl.eu/corpora-search/en&ru) as of **March 28, 2026**.
A multi-stage cleaning and ranking pipeline was applied:
1. **Heuristic filtering** using the utilities from [en-ru-corpus-utils](https://github.com/KvaytG/en-ru-corpus-utils).
2. **Deduplication** with `removedup`.
3. **Quality ranking** using LaBSE cosine similarity.
To process the massive volume efficiently, LaBSE embeddings were computed via **model2vec + PCA (pca_dims=300)**.
Only the **top 20 million** pairs by similarity score were retained.
The dataset is **sorted in descending order by LaBSE score** (highest quality first).
## Languages
- **English** (`en`)
- **Russian** (`ru`)
## Data Fields
| Column | Type | Description |
|-----------|---------|----------------------------------------------------------------------------------------------------------------------|
| `english` | string | English sentence |
| `russian` | string | Russian sentence |
| `score` | float32 | LaBSE cosine similarity score (higher = better alignment). The dataset is sorted by this column in descending order. |
## Data Splits
| Split | Number of examples |
|---------|--------------------|
| `train` | 20,000,000 |
(No predefined validation or test splits — you can easily create them yourself.)
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("KvaytG/en-ru-parallel-20m", split="train")
```
## License & Legal Disclaimer
This dataset is an aggregation of multiple corpora sourced from the **OPUS project**.
Because it contains data from **all** available en-ru OPUS sources (as of March 28, 2026), it is a mixed-license collection. The underlying texts retain their original licenses, which vary significantly:
* Some data is Public Domain or permissive (e.g., Europarl, UNPC).
* Some data uses Copyleft licenses (e.g., CC-BY-SA for Wikipedia).
* Some data strictly prohibits commercial use (e.g., CC-BY-NC for TED/QED).
* Some data may be subject to copyright (e.g., OpenSubtitles).
**Therefore, this aggregated dataset is not released under a single permissive license like MIT.** By downloading and using this dataset, you acknowledge that:
1. The author of this dataset does not own the copyright to the underlying texts.
2. The dataset is provided primarily for **research and educational purposes**.
3. You are solely responsible for ensuring that your use of this data (especially in commercial applications) complies with the original licenses of the respective OPUS sub-corpora.
## Citation
```bibtex
@misc{kvaytg_en_ru_parallel_20m,
author = {KvaytG},
title = {20M high-quality English-Russian parallel corpus},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Datasets},
url = {https://huggingface.co/datasets/KvaytG/en-ru-parallel-20m},
note = {Built from all OPUS en-ru corpora (28 Mar 2026) with heuristic cleaning, deduplication and LaBSE ranking via model2vec+PCA}
}
```
提供机构:
KvaytG



