five

KvaytG/en-ru-parallel-10m

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KvaytG/en-ru-parallel-10m
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en - ru tags: - translation - machine-translation - parallel-corpus - nlp - en-ru size_categories: - 10M<n<100M dataset_info: features: - name: text_en dtype: string - name: text_ru dtype: string - name: score dtype: float32 splits: - name: train num_examples: 10000000 --- # en-ru-parallel-10m **10 million** English-Russian parallel sentence pairs, filtered and ranked using LaBSE (Model2Vec). ## Dataset Description This dataset contains **10,000,000** English-Russian parallel sentence pairs. Unlike datasets that only focus on the absolute top-scoring pairs (which often contain very short or repetitive strings), this corpus captures the **central distribution** of a massive crawled collection, ensuring a balance between high semantic alignment and linguistic diversity. ## Dataset Sources | Source | Link | License | |:-----------------|:-----------------------------------------------------------------------------|:------------------------------| | **UNPC** | [OPUS - UNPC](https://opus.nlpl.eu/datasets/UNPC?pair=en&ru) | UN (Public Domain/Permission) | | **MultiUN** | [OPUS - MultiUN](https://opus.nlpl.eu/datasets/MultiUN?pair=en&ru) | UN (Public Domain/Permission) | | **GlobalVoices** | [OPUS - GlobalVoices](https://opus.nlpl.eu/datasets/GlobalVoices?pair=en&ru) | CC-BY | | **Tatoeba** | [Tatoeba Downloads](https://tatoeba.org/en/downloads) | CC-BY 2.0 FR | | **bible-uedin** | [OPUS - bible-uedin](https://opus.nlpl.eu/datasets/bible-uedin?pair=en&ru) | CC0 | ## Dataset Summary The dataset was derived from a large-scale raw parallel corpus through the following pipeline: 1. **Heuristic Cleaning**: Initial filtering of noisy data, fixing encoding issues, and removing clearly mismatched pairs. 2. **Deduplication**: Removing identical pairs to ensure training diversity. 3. **Semantic Scoring**: * Utilized **LaBSE** (Language-Agnostic BERT Sentence Embedding). * To achieve high performance, we used the **Model2Vec** version of LaBSE (`labse_m2v_300`), which utilizes static embeddings and PCA (300 dims) for extremely fast inference without significant loss in alignment quality. * Cosine similarity scores were calculated for every pair. 4. **Distribution-based Filtering**: * The entire corpus was sorted by the LaBSE score. * Instead of taking the top 10M (which often over-represents identical "perfect" matches or very short sentences), we **trimmed the top and bottom tails** of the distribution. * The final 10,000,000 pairs represent the **middle slice**, providing stable, high-quality translations ranging from approximately **0.8648** to **0.5197** similarity scores. ## Languages - **English** (`en`) - **Russian** (`ru`) ## Data Fields | Column | Type | Description | |-----------|---------|---------------------------------------------------------------| | `text_en` | string | English sentence. | | `text_ru` | string | Russian sentence. | | `score` | float32 | LaBSE cosine similarity score. Truncated to 4 decimal places. | ## Data Splits | Split | Number of examples | |---------|--------------------| | `train` | 10,000,000 | ## Usage ```python from datasets import load_dataset dataset = load_dataset("KvaytG/en-ru-parallel-10m", split="train") ``` ## License This dataset is released under the **Apache License 2.0**. ## Citation ```bibtex @misc{kvaytg_en_ru_parallel_10m, author = {KvaytG}, title = {10M Balanced English-Russian Parallel Corpus (Middle Slice)}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face Datasets}, url = {https://huggingface.co/datasets/KvaytG/en-ru-parallel-10m}, note = {10M parallel pairs representing the middle distribution of OPUS en-ru data. Filtered via model2vec LaBSE (300 dims) and ranked by cosine similarity.} } ```
提供机构:
KvaytG
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作