aimamba/WikiMatrix-en-lv

Name: aimamba/WikiMatrix-en-lv
Creator: aimamba
Published: 2026-04-13 13:19:53
License: 暂无描述

Hugging Face2026-04-13 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/aimamba/WikiMatrix-en-lv

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - lv license: cc-by-sa-4.0 task_categories: - translation tags: - parallel-corpus - wikipedia - wikimatrix - en-lv - latvian - machine-translation - labse - faiss size_categories: - 100K<n<1M source_datasets: - wikipedia pretty_name: WikiMatrix EN-LV dataset_info: features: - name: en dtype: string - name: lv dtype: string - name: score dtype: float64 splits: - name: train num_examples: 537732 - name: validation num_examples: 29874 - name: test num_examples: 29874 --- # WikiMatrix EN-LV ## Dataset Description **597,480** English-Latvian parallel sentence pairs mined from Wikipedia using the WikiMatrix methodology. ### Method 1. **Source**: English and Latvian Wikipedia article dumps (April 2026) 2. **Embeddings**: [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) (Language-agnostic BERT Sentence Embeddings) 3. **Retrieval**: FAISS approximate nearest-neighbor search (IndexFlatIP) 4. **Scoring**: Margin-based scoring — `margin(x, y) = cos(x, y) / [(Σ cos(x, nn_y) + Σ cos(y, nn_x)) / (2k)]` 5. **Filtering**: Pairs with margin score ≥ 1.04 retained ### Columns | Column | Type | Description | |--------|------|-------------| | `en` | string | English sentence | | `lv` | string | Latvian sentence | | `score` | float | Margin similarity score (higher = more confident alignment) | ### Splits | Split | Examples | |-------|----------| | train | 537,732 | | validation | 29,874 | | test | 29,874 | ### Quality Thresholds The `score` column can be used to filter for higher-quality pairs: | Threshold | Approx. Pairs | Quality | |-----------|---------------|---------| | ≥ 1.04 | 597,480 | All pairs (full dataset) | | ≥ 1.06 | ~500,000 | **Recommended for training** | | ≥ 1.10 | ~350,000 | High confidence | | ≥ 1.20 | ~150,000 | Very high confidence | ```python from datasets import load_dataset ds = load_dataset("aimamba/WikiMatrix-en-lv") # Filter for high-quality pairs high_quality = ds["train"].filter(lambda x: x["score"] >= 1.06) print(f"High-quality pairs: {len(high_quality)}") ``` ### License CC-BY-SA 4.0 (inherited from Wikipedia) ### Citation ```bibtex @misc{wikimatrix-en-lv-2026, title={WikiMatrix EN-LV: English-Latvian Parallel Corpus from Wikipedia}, author={aimamba}, year={2026}, howpublished={\url{https://huggingface.co/datasets/aimamba/WikiMatrix-en-lv}}, note={597,480 sentence pairs mined using LaBSE + FAISS} } ``` ### Acknowledgments - [LaBSE](https://arxiv.org/abs/2007.01852) — Feng et al., 2022 - [WikiMatrix](https://arxiv.org/abs/1907.05791) — Schwenk et al., 2019 (methodology inspiration) - [FAISS](https://github.com/facebookresearch/faiss) — Facebook AI Research

提供机构：

aimamba

5,000+

优质数据集

54 个

任务类型

进入经典数据集