parallel-sentences
收藏魔搭社区2025-11-06 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/parallel-sentences
下载链接
链接失效反馈官方服务:
资源简介:
# Parallel Sentences for 50+ languages
> [!NOTE]
> This repository contains raw datasets, all of which have also been formatted for easy training in the [Parallel Sentences Datasets](https://huggingface.co/collections/sentence-transformers/parallel-sentences-datasets-6644d644123d31ba5b1c8785) collection. We recommend looking there first.
This repository contains parallel sentences (i.e. English + same sentences in other language) for 50+ different languages in a simple tsv.gz format:
```
english_sentences\tsentence_in_other_language
```
Sentences stem from the [OPUS website](https://opus.nlpl.eu/).
The following datasets are included:
- [Europarl](https://opus.nlpl.eu/Europarl.php)
- [GlobalVoices](https://opus.nlpl.eu/GlobalVoices.php)
- [JW300](https://opus.nlpl.eu/JW300.php)
- [MUSE](https://github.com/facebookresearch/MUSE)
- [News-Commentary](https://opus.nlpl.eu/News-Commentary.php)
- [OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles.php)
- [Tatoeba](https://tatoeba.org/)
- Talks - Custom translated transcripts of talks
- [WikiMatrix](https://opus.nlpl.eu/WikiMatrix.php)
- WikiTitles - Custom dataset with parallel Wikipedia titles
## Usage
These sentences can be used to train multi-lingual sentence embedding models. For more details, see [SBERT.net - Multilingual-Model](https://www.sbert.net/examples/training/multilingual/README.html)
**This dataset can not yet be used with Hugging Face dataset library. You must download the individual TSV files.**
# 面向50余种语言的平行语句数据集
> 注:本仓库包含原始数据集,所有数据均已完成格式适配以支持便捷训练,相关数据集已收录于[平行语句数据集合集](https://huggingface.co/collections/sentence-transformers/parallel-sentences-datasets-6644d644123d31ba5b1c8785)中。我们建议您优先参考该合集。
本仓库收录了50余种语言的平行语句(即英语与其他语言的对应同译语句),采用轻量型TSV.GZ压缩格式存储,格式示例如下:
英语语句 目标语言语句
本次所用语料均源自[OPUS网站](https://opus.nlpl.eu/)。
包含的数据集如下:
- [欧洲议会语料库(Europarl)](https://opus.nlpl.eu/Europarl.php)
- [全球之声语料库(GlobalVoices)](https://opus.nlpl.eu/GlobalVoices.php)
- [JW300语料库(JW300)](https://opus.nlpl.eu/JW300.php)
- [MUSE语料库(MUSE)](https://github.com/facebookresearch/MUSE)
- [新闻评论语料库(News-Commentary)](https://opus.nlpl.eu/News-Commentary.php)
- [开放字幕语料库(OpenSubtitles)](https://opus.nlpl.eu/OpenSubtitles.php)
- [Tatoeba多语言平行语料库(Tatoeba)](https://tatoeba.org/)
- 演讲语料库(Talks):自定义翻译的演讲转录文本
- [维基矩阵语料库(WikiMatrix)](https://opus.nlpl.eu/WikiMatrix.php)
- 维基标题语料库(WikiTitles):自定义平行维基百科标题数据集
## 使用说明
本数据集可用于训练多语言语句嵌入模型。如需了解更多细节,请参考[SBERT.net多语言模型教程](https://www.sbert.net/examples/training/multilingual/README.html)
**注意:本数据集暂不支持通过Hugging Face数据集库直接调用,需手动下载单个TSV文件。**
提供机构:
maas
创建时间:
2025-01-06



