five

aimamba/latvian-english-atomic-translation

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/aimamba/latvian-english-atomic-translation
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - lv license: cc-by-4.0 task_categories: - translation tags: - latvian - english - translation - atomic-template - distillation size_categories: - 1M<n<10M --- # Latvian-English ATOMIC Translation Dataset Private dataset for distilling TildeOpen-30B into Qwen2 1.5B. ## Dataset Description 5,236,232 bidirectional Latvian↔English translation examples in ATOMIC chat JSONL format. ### Sources - OpenSubtitles (casual): 51.4% - Europarl (formal): 23.6% - WikiMatrix (encyclopedic): 18.5% - MUSE Dictionary: 3.5% - KDE4+GNOME+Ubuntu (technical): 2.9% - Tatoeba (short): 0.1% ### Format Each example is a chat-format JSONL entry: ```json {"messages": [{"role": "system", "content": "You are a Latvian-English translation assistant..."}, {"role": "user", "content": "Ko tu dari?"}, {"role": "assistant", "content": "What are you doing?"}]} ``` ### Splits - Train: 4,712,608 examples - Validation: 261,811 examples - Test: 261,813 examples ### Usage ```python from datasets import load_dataset ds = load_dataset("aimamba/latvian-english-atomic-translation", data_files={"train": "data/train.jsonl", "validation": "data/validation.jsonl", "test": "data/test.jsonl"}) ```
提供机构:
aimamba
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作