parallel-sentences-opensubtitles
收藏魔搭社区2025-11-12 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/parallel-sentences-opensubtitles
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Parallel Sentences - OpenSubtitles
This dataset contains parallel sentences (i.e. English sentence + the same sentences in another language) for numerous other languages. Most of the sentences originate from the [OPUS website](https://opus.nlpl.eu/).
In particular, this dataset contains the [OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles) dataset.
Warning! The quality of this dataset is not great; many of the english and non-english texts don't match well, or are fully empty.
## Related Datasets
The following datasets are also a part of the Parallel Sentences collection:
* [parallel-sentences-europarl](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl)
* [parallel-sentences-global-voices](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-global-voices)
* [parallel-sentences-muse](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-muse)
* [parallel-sentences-jw300](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-jw300)
* [parallel-sentences-news-commentary](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-news-commentary)
* [parallel-sentences-opensubtitles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opensubtitles)
* [parallel-sentences-talks](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-talks)
* [parallel-sentences-tatoeba](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-tatoeba)
* [parallel-sentences-wikimatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikimatrix)
* [parallel-sentences-wikititles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikititles)
* [parallel-sentences-ccmatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-ccmatrix)
These datasets can be used to train multilingual sentence embedding models. For more information, see [sbert.net - Multilingual Models](https://www.sbert.net/examples/training/multilingual/README.html).
## Dataset Subsets
### `all` subset
* Columns: "english", "non_english"
* Column types: `str`, `str`
* Examples:
```python
{
"english": "We can't predict it and we can't control it.",
"non_english": "نحن لا نَستطيعُ تَوَقُّعه ونحن لا نَستطيعُ السَيْطَرَة عليه."
}
```
* Collection strategy: Combining all other subsets from this dataset.
* Deduplified: No
### `en-...` subsets
* Columns: "english", "non_english"
* Column types: `str`, `str`
* Examples:
```python
{
"english": "ever faithful, ever true, nothing stops him, he'll get through.",
"non_english": "우리의 한결같은 심부름꾼 황새 아저씨 가는 길을 그 누가 막으랴!"
}
```
* Collection strategy: Processing the raw data from [parallel-sentences](https://huggingface.co/datasets/sentence-transformers/parallel-sentences) and formatting it in Parquet, followed by deduplication.
* Deduplified: Yes
# 平行语句(Parallel Sentences)数据集卡片 —— OpenSubtitles
本数据集包含适用于多种其他语言的平行语句(即英语语句与对应其他语言的同语义语句)。绝大多数语句源自[OPUS网站](https://opus.nlpl.eu/)。
具体而言,本数据集包含[OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles/corpus/version/OpenSubtitles)数据集。
警告!本数据集质量欠佳;诸多英语与非英语文本匹配度不佳,或完全为空。
## 相关数据集
下述数据集同样属于平行语句集合:
* [parallel-sentences-europarl](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl)
* [parallel-sentences-global-voices](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-global-voices)
* [parallel-sentences-muse](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-muse)
* [parallel-sentences-jw300](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-jw300)
* [parallel-sentences-news-commentary](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-news-commentary)
* [parallel-sentences-opensubtitles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opensubtitles)
* [parallel-sentences-talks](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-talks)
* [parallel-sentences-tatoeba](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-tatoeba)
* [parallel-sentences-wikimatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikimatrix)
* [parallel-sentences-wikititles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikititles)
* [parallel-sentences-ccmatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-ccmatrix)
这些数据集可用于训练多语种语句嵌入模型。更多信息请参阅[sbert.net - 多语种模型](https://www.sbert.net/examples/training/multilingual/README.html)。
## 数据集子集
### `all` 子集
* 字段:`english`(英语语句)、`non_english`(非英语语句)
* 字段类型:字符串(str)、字符串(str)
* 示例:
python
{
"english": "We can't predict it and we can't control it.",
"non_english": "نحن لا نَستطيعُ تَوَقُّعه ونحن لا نَستطيعُ السَيْطَرَة عليه."
}
* 收集策略:合并本数据集的所有其他子集
* 去重状态:未去重
### `en-...` 子集
* 字段:`english`、`non_english`
* 字段类型:字符串(str)、字符串(str)
* 示例:
python
{
"english": "ever faithful, ever true, nothing stops him, he'll get through.",
"non_english": "우리의 한결같은 심부름꾼 황새 아저씨 가는 길을 그 누가 막으랴!"
}
* 收集策略:对[parallel-sentences](https://huggingface.co/datasets/sentence-transformers/parallel-sentences)的原始数据进行处理并以Parquet格式格式化,随后执行去重操作
* 去重状态:已去重
提供机构:
maas
创建时间:
2025-01-06



