aimamba/WikiMatrix-en-lv
收藏Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/aimamba/WikiMatrix-en-lv
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- lv
license: cc-by-sa-4.0
task_categories:
- translation
tags:
- parallel-corpus
- wikipedia
- wikimatrix
- en-lv
- latvian
- machine-translation
- labse
- faiss
size_categories:
- 100K<n<1M
source_datasets:
- wikipedia
pretty_name: WikiMatrix EN-LV
dataset_info:
features:
- name: en
dtype: string
- name: lv
dtype: string
- name: score
dtype: float64
splits:
- name: train
num_examples: 537732
- name: validation
num_examples: 29874
- name: test
num_examples: 29874
---
# WikiMatrix EN-LV
## Dataset Description
**597,480** English-Latvian parallel sentence pairs mined from Wikipedia using the WikiMatrix methodology.
### Method
1. **Source**: English and Latvian Wikipedia article dumps (April 2026)
2. **Embeddings**: [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) (Language-agnostic BERT Sentence Embeddings)
3. **Retrieval**: FAISS approximate nearest-neighbor search (IndexFlatIP)
4. **Scoring**: Margin-based scoring — `margin(x, y) = cos(x, y) / [(Σ cos(x, nn_y) + Σ cos(y, nn_x)) / (2k)]`
5. **Filtering**: Pairs with margin score ≥ 1.04 retained
### Columns
| Column | Type | Description |
|--------|------|-------------|
| `en` | string | English sentence |
| `lv` | string | Latvian sentence |
| `score` | float | Margin similarity score (higher = more confident alignment) |
### Splits
| Split | Examples |
|-------|----------|
| train | 537,732 |
| validation | 29,874 |
| test | 29,874 |
### Quality Thresholds
The `score` column can be used to filter for higher-quality pairs:
| Threshold | Approx. Pairs | Quality |
|-----------|---------------|---------|
| ≥ 1.04 | 597,480 | All pairs (full dataset) |
| ≥ 1.06 | ~500,000 | **Recommended for training** |
| ≥ 1.10 | ~350,000 | High confidence |
| ≥ 1.20 | ~150,000 | Very high confidence |
```python
from datasets import load_dataset
ds = load_dataset("aimamba/WikiMatrix-en-lv")
# Filter for high-quality pairs
high_quality = ds["train"].filter(lambda x: x["score"] >= 1.06)
print(f"High-quality pairs: {len(high_quality)}")
```
### License
CC-BY-SA 4.0 (inherited from Wikipedia)
### Citation
```bibtex
@misc{wikimatrix-en-lv-2026,
title={WikiMatrix EN-LV: English-Latvian Parallel Corpus from Wikipedia},
author={aimamba},
year={2026},
howpublished={\url{https://huggingface.co/datasets/aimamba/WikiMatrix-en-lv}},
note={597,480 sentence pairs mined using LaBSE + FAISS}
}
```
### Acknowledgments
- [LaBSE](https://arxiv.org/abs/2007.01852) — Feng et al., 2022
- [WikiMatrix](https://arxiv.org/abs/1907.05791) — Schwenk et al., 2019 (methodology inspiration)
- [FAISS](https://github.com/facebookresearch/faiss) — Facebook AI Research
提供机构:
aimamba



