parallel-sentences-opus-100
收藏魔搭社区2025-11-12 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/parallel-sentences-opus-100
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Parallel Sentences - OPUS-100
This dataset contains parallel sentences (i.e. English sentence + the same sentences in another language) for numerous other languages. The sentences originate from the [OPUS-100 website](https://opus.nlpl.eu/opus-100.php).
In particular, this dataset is a reformatting of the [OPUS-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100) dataset.
## Related Datasets
The following datasets are also a part of the Parallel Sentences collection:
* [parallel-sentences-europarl](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl)
* [parallel-sentences-global-voices](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-global-voices)
* [parallel-sentences-muse](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-muse)
* [parallel-sentences-jw300](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-jw300)
* [parallel-sentences-news-commentary](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-news-commentary)
* [parallel-sentences-opensubtitles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opensubtitles)
* [parallel-sentences-talks](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-talks)
* [parallel-sentences-tatoeba](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-tatoeba)
* [parallel-sentences-wikimatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikimatrix)
* [parallel-sentences-wikititles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikititles)
Recent additions (May 2024):
* [parallel-sentences-opus-100](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opus-100)
These datasets can be used to train multilingual sentence embedding models. For more information, see [sbert.net - Multilingual Models](https://www.sbert.net/examples/training/multilingual/README.html).
## Dataset Stats
* Columns: "english", "non_english"
* Column types: `str`, `str`
* Examples:
```python
{
"english": "Run Program",
"non_english": "Rith Ríomhchlár"
}
```
* Collection strategy: Processing the raw data from [OPUS-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100) and restructuring it into 2 columns: "english" and "non_english".
* Deduplified: No
# 数据集卡片:平行语句集——OPUS-100
本数据集涵盖多语种平行语句(即英文语句与对应其他语言的同义语句),所有语句均源自[OPUS-100官网](https://opus.nlpl.eu/opus-100.php)。
具体而言,本数据集是对[OPUS-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100)数据集的重新格式化处理。
## 相关数据集
以下数据集同样隶属于平行语句集(Parallel Sentences)集合:
* [parallel-sentences-europarl](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl)
* [parallel-sentences-global-voices](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-global-voices)
* [parallel-sentences-muse](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-muse)
* [parallel-sentences-jw300](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-jw300)
* [parallel-sentences-news-commentary](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-news-commentary)
* [parallel-sentences-opensubtitles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opensubtitles)
* [parallel-sentences-talks](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-talks)
* [parallel-sentences-tatoeba](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-tatoeba)
* [parallel-sentences-wikimatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikimatrix)
* [parallel-sentences-wikititles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikititles)
2024年5月新增数据集:
* [parallel-sentences-opus-100](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opus-100)
此类数据集可用于训练多语种语句嵌入模型,更多信息请参阅[sbert.net——多语种模型](https://www.sbert.net/examples/training/multilingual/README.html)。
## 数据集统计信息
* 字段:"english"、"non_english"
* 字段类型:字符串型(`str`)、字符串型(`str`)
* 示例:
python
{
"english": "Run Program",
"non_english": "Rith Ríomhchlár"
}
* 采集策略:对源自[OPUS-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100)的原始数据进行处理,并将其重构为"english"与"non_english"两个字段。
* 去重处理:未进行去重
提供机构:
maas
创建时间:
2025-01-06



