parallel-sentences-ccmatrix
收藏魔搭社区2025-11-12 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/parallel-sentences-ccmatrix
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Parallel Sentences - CCMatrix
This dataset contains parallel sentences (i.e. English sentence + the same sentences in another language) for numerous other languages. The texts originate from the [CCMatrix](https://ai.meta.com/blog/ccmatrix-a-billion-scale-bitext-data-set-for-training-translation-models/) dataset.
## Related Datasets
The following datasets are also a part of the Parallel Sentences collection:
* [parallel-sentences-europarl](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl)
* [parallel-sentences-global-voices](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-global-voices)
* [parallel-sentences-muse](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-muse)
* [parallel-sentences-jw300](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-jw300)
* [parallel-sentences-news-commentary](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-news-commentary)
* [parallel-sentences-opensubtitles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opensubtitles)
* [parallel-sentences-talks](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-talks)
* [parallel-sentences-tatoeba](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-tatoeba)
* [parallel-sentences-wikimatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikimatrix)
* [parallel-sentences-wikititles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikititles)
* [parallel-sentences-ccmatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-ccmatrix)
These datasets can be used to train multilingual sentence embedding models. For more information, see [sbert.net - Multilingual Models](https://www.sbert.net/examples/training/multilingual/README.html).
## Dataset Subsets
### `en-...` subsets
* Columns: "english", "non_english"
* Column types: `str`, `str`
* Examples:
```python
{
"english": "He and his mother will be standing vigil there.”",
"non_english": "Él y su madre estarán de vigilia allí”.",
}
```
* Collection strategy: Processing the data from [yhavinga/ccmatrix](https://huggingface.co/datasets/yhavinga/ccmatrix) and reformatting it in Parquet and with "english" and "non_english" columns.
* Deduplified: No
# 平行语料数据集卡片 —— CCMatrix
本数据集包含面向诸多其他语言的平行语料句对(即英文句子与对应其他语言的同一句子),其文本源自[CCMatrix](https://ai.meta.com/blog/ccmatrix-a-billion-scale-bitext-data-set-for-training-translation-models/)数据集。
## 相关数据集
下述数据集同样隶属于平行语料句对集合:
* [平行语料-欧洲议会(parallel-sentences-europarl)](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl)
* [平行语料-全球之声(parallel-sentences-global-voices)](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-global-voices)
* [平行语料-MUSE(parallel-sentences-muse)](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-muse)
* [平行语料-JW300(parallel-sentences-jw300)](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-jw300)
* [平行语料-新闻评论(parallel-sentences-news-commentary)](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-news-commentary)
* [平行语料-开放字幕(parallel-sentences-opensubtitles)](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opensubtitles)
* [平行语料-演讲(parallel-sentences-talks)](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-talks)
* [平行语料-Tatoeba(parallel-sentences-tatoeba)](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-tatoeba)
* [平行语料-WikiMatrix(parallel-sentences-wikimatrix)](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikimatrix)
* [平行语料-Wiki标题(parallel-sentences-wikititles)](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikititles)
* [平行语料-CCMatrix(parallel-sentences-ccmatrix)](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-ccmatrix)
上述数据集可用于训练多语言句子嵌入模型。如需了解更多信息,请参阅[sbert.net - 多语言模型](https://www.sbert.net/examples/training/multilingual/README.html)。
## 数据集子集
### `en-……` 子集
* 列名:"english"、"non_english"
* 列类型:`str`、`str`
* 示例:
python
{
"english": "He and his mother will be standing vigil there.”",
"non_english": "Él y su madre estarán de vigilia allí”.",
}
* 采集策略:从[yhavinga/ccmatrix](https://huggingface.co/datasets/yhavinga/ccmatrix)获取原始数据,并将其重新格式化为Parquet格式,同时设置"english"与"non_english"两列。
* 去重:否
提供机构:
maas
创建时间:
2025-01-06



