aimamba/latvian-english-atomic-translation
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/aimamba/latvian-english-atomic-translation
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- lv
license: cc-by-4.0
task_categories:
- translation
tags:
- latvian
- english
- translation
- atomic-template
- distillation
size_categories:
- 1M<n<10M
---
# Latvian-English ATOMIC Translation Dataset
Private dataset for distilling TildeOpen-30B into Qwen2 1.5B.
## Dataset Description
5,236,232 bidirectional Latvian↔English translation examples in ATOMIC chat JSONL format.
### Sources
- OpenSubtitles (casual): 51.4%
- Europarl (formal): 23.6%
- WikiMatrix (encyclopedic): 18.5%
- MUSE Dictionary: 3.5%
- KDE4+GNOME+Ubuntu (technical): 2.9%
- Tatoeba (short): 0.1%
### Format
Each example is a chat-format JSONL entry:
```json
{"messages": [{"role": "system", "content": "You are a Latvian-English translation assistant..."}, {"role": "user", "content": "Ko tu dari?"}, {"role": "assistant", "content": "What are you doing?"}]}
```
### Splits
- Train: 4,712,608 examples
- Validation: 261,811 examples
- Test: 261,813 examples
### Usage
```python
from datasets import load_dataset
ds = load_dataset("aimamba/latvian-english-atomic-translation", data_files={"train": "data/train.jsonl", "validation": "data/validation.jsonl", "test": "data/test.jsonl"})
```
提供机构:
aimamba



