KvaytG/en-ru-parallel-20m

Name: KvaytG/en-ru-parallel-20m
Creator: KvaytG
Published: 2026-04-19 15:23:02
License: 暂无描述

Hugging Face2026-04-19 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/KvaytG/en-ru-parallel-20m

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other language: - en - ru tags: - translation - machine-translation - parallel-corpus - nlp - en-ru size_categories: - 10M<n<100M dataset_info: features: - name: english dtype: string - name: russian dtype: string - name: score dtype: float32 splits: - name: train num_examples: 20000000 --- # en-ru-parallel-20m **20 million** highest-quality English-Russian parallel sentence pairs. ## Dataset Description This dataset contains **20,000,000** carefully filtered English-Russian parallel sentence pairs. It was created specifically for machine translation, multilingual embedding training, model fine-tuning, and any other NLP tasks that require a large high-quality en-ru parallel corpus. ## Dataset Summary The corpus was built from **ALL** English-Russian datasets available on [OPUS](https://opus.nlpl.eu/corpora-search/en&ru) as of **March 28, 2026**. A multi-stage cleaning and ranking pipeline was applied: 1. **Heuristic filtering** using the utilities from [en-ru-corpus-utils](https://github.com/KvaytG/en-ru-corpus-utils). 2. **Deduplication** with `removedup`. 3. **Quality ranking** using LaBSE cosine similarity. To process the massive volume efficiently, LaBSE embeddings were computed via **model2vec + PCA (pca_dims=300)**. Only the **top 20 million** pairs by similarity score were retained. The dataset is **sorted in descending order by LaBSE score** (highest quality first). ## Languages - **English** (`en`) - **Russian** (`ru`) ## Data Fields | Column | Type | Description | |-----------|---------|----------------------------------------------------------------------------------------------------------------------| | `english` | string | English sentence | | `russian` | string | Russian sentence | | `score` | float32 | LaBSE cosine similarity score (higher = better alignment). The dataset is sorted by this column in descending order. | ## Data Splits | Split | Number of examples | |---------|--------------------| | `train` | 20,000,000 | (No predefined validation or test splits — you can easily create them yourself.) ## Usage ```python from datasets import load_dataset dataset = load_dataset("KvaytG/en-ru-parallel-20m", split="train") ``` ## License & Legal Disclaimer This dataset is an aggregation of multiple corpora sourced from the **OPUS project**. Because it contains data from **all** available en-ru OPUS sources (as of March 28, 2026), it is a mixed-license collection. The underlying texts retain their original licenses, which vary significantly: * Some data is Public Domain or permissive (e.g., Europarl, UNPC). * Some data uses Copyleft licenses (e.g., CC-BY-SA for Wikipedia). * Some data strictly prohibits commercial use (e.g., CC-BY-NC for TED/QED). * Some data may be subject to copyright (e.g., OpenSubtitles). **Therefore, this aggregated dataset is not released under a single permissive license like MIT.** By downloading and using this dataset, you acknowledge that: 1. The author of this dataset does not own the copyright to the underlying texts. 2. The dataset is provided primarily for **research and educational purposes**. 3. You are solely responsible for ensuring that your use of this data (especially in commercial applications) complies with the original licenses of the respective OPUS sub-corpora. ## Citation ```bibtex @misc{kvaytg_en_ru_parallel_20m, author = {KvaytG}, title = {20M high-quality English-Russian parallel corpus}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/KvaytG/en-ru-parallel-20m}, note = {Built from all OPUS en-ru corpora (28 Mar 2026) with heuristic cleaning, deduplication and LaBSE ranking via model2vec+PCA} } ```

提供机构：

KvaytG

5,000+

优质数据集

54 个

任务类型

进入经典数据集