five

KvaytG/en-ru-parallel-20m

收藏
Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KvaytG/en-ru-parallel-20m
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other language: - en - ru tags: - translation - machine-translation - parallel-corpus - nlp - en-ru size_categories: - 10M<n<100M dataset_info: features: - name: english dtype: string - name: russian dtype: string - name: score dtype: float32 splits: - name: train num_examples: 20000000 --- # en-ru-parallel-20m **20 million** highest-quality English-Russian parallel sentence pairs. ## Dataset Description This dataset contains **20,000,000** carefully filtered English-Russian parallel sentence pairs. It was created specifically for machine translation, multilingual embedding training, model fine-tuning, and any other NLP tasks that require a large high-quality en-ru parallel corpus. ## Dataset Summary The corpus was built from **ALL** English-Russian datasets available on [OPUS](https://opus.nlpl.eu/corpora-search/en&ru) as of **March 28, 2026**. A multi-stage cleaning and ranking pipeline was applied: 1. **Heuristic filtering** using the utilities from [en-ru-corpus-utils](https://github.com/KvaytG/en-ru-corpus-utils). 2. **Deduplication** with `removedup`. 3. **Quality ranking** using LaBSE cosine similarity. To process the massive volume efficiently, LaBSE embeddings were computed via **model2vec + PCA (pca_dims=300)**. Only the **top 20 million** pairs by similarity score were retained. The dataset is **sorted in descending order by LaBSE score** (highest quality first). ## Languages - **English** (`en`) - **Russian** (`ru`) ## Data Fields | Column | Type | Description | |-----------|---------|----------------------------------------------------------------------------------------------------------------------| | `english` | string | English sentence | | `russian` | string | Russian sentence | | `score` | float32 | LaBSE cosine similarity score (higher = better alignment). The dataset is sorted by this column in descending order. | ## Data Splits | Split | Number of examples | |---------|--------------------| | `train` | 20,000,000 | (No predefined validation or test splits — you can easily create them yourself.) ## Usage ```python from datasets import load_dataset dataset = load_dataset("KvaytG/en-ru-parallel-20m", split="train") ``` ## License & Legal Disclaimer This dataset is an aggregation of multiple corpora sourced from the **OPUS project**. Because it contains data from **all** available en-ru OPUS sources (as of March 28, 2026), it is a mixed-license collection. The underlying texts retain their original licenses, which vary significantly: * Some data is Public Domain or permissive (e.g., Europarl, UNPC). * Some data uses Copyleft licenses (e.g., CC-BY-SA for Wikipedia). * Some data strictly prohibits commercial use (e.g., CC-BY-NC for TED/QED). * Some data may be subject to copyright (e.g., OpenSubtitles). **Therefore, this aggregated dataset is not released under a single permissive license like MIT.** By downloading and using this dataset, you acknowledge that: 1. The author of this dataset does not own the copyright to the underlying texts. 2. The dataset is provided primarily for **research and educational purposes**. 3. You are solely responsible for ensuring that your use of this data (especially in commercial applications) complies with the original licenses of the respective OPUS sub-corpora. ## Citation ```bibtex @misc{kvaytg_en_ru_parallel_20m, author = {KvaytG}, title = {20M high-quality English-Russian parallel corpus}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/KvaytG/en-ru-parallel-20m}, note = {Built from all OPUS en-ru corpora (28 Mar 2026) with heuristic cleaning, deduplication and LaBSE ranking via model2vec+PCA} } ```
提供机构:
KvaytG
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作