parallel-sentences-news-commentary
收藏魔搭社区2025-11-07 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/parallel-sentences-news-commentary
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Parallel Sentences - News Commentary
This dataset contains parallel sentences (i.e. English sentence + the same sentences in another language) for numerous other languages. Most of the sentences originate from the [OPUS website](https://opus.nlpl.eu/).
In particular, this dataset contains the [News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) dataset.
## Related Datasets
The following datasets are also a part of the Parallel Sentences collection:
* [parallel-sentences-europarl](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl)
* [parallel-sentences-global-voices](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-global-voices)
* [parallel-sentences-muse](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-muse)
* [parallel-sentences-jw300](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-jw300)
* [parallel-sentences-news-commentary](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-news-commentary)
* [parallel-sentences-opensubtitles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opensubtitles)
* [parallel-sentences-talks](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-talks)
* [parallel-sentences-tatoeba](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-tatoeba)
* [parallel-sentences-wikimatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikimatrix)
* [parallel-sentences-wikititles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikititles)
* [parallel-sentences-ccmatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-ccmatrix)
These datasets can be used to train multilingual sentence embedding models. For more information, see [sbert.net - Multilingual Models](https://www.sbert.net/examples/training/multilingual/README.html).
## Dataset Subsets
### `all` subset
* Columns: "english", "non_english"
* Column types: `str`, `str`
* Examples:
```python
{
"english": "Pure interests – expressed through lobbying power – were undoubtedly important to several key deregulation measures in the US, whose political system and campaign-finance rules are peculiarly conducive to the power of specific lobbies.",
"non_english": "Заинтересованные группы, действующие посредством лоббирования власти, явились важными действующими лицами при принятии нескольких ключевых мер по отмене регулирующих норм в США, чья политическая система и правила финансирования кампаний особенно поддаются власти отдельных лобби."
}
```
* Collection strategy: Combining all other subsets from this dataset.
* Deduplified: No
### `en-...` subsets
* Columns: "english", "non_english"
* Column types: `str`, `str`
* Examples:
```python
{
"english": "Last December, many gold bugs were arguing that the price was inevitably headed for $2,000.",
"non_english": "Lo scorso dicembre, molti fanatici dell’oro sostenevano che il suo prezzo era inevitabilmente destinato a raggiungere i 2000 dollari."
}
```
* Collection strategy: Processing the raw data from [parallel-sentences](https://huggingface.co/datasets/sentence-transformers/parallel-sentences) and formatting it in Parquet, followed by deduplication.
* Deduplified: Yes
# 平行语句数据集卡片 —— 新闻评论数据集
该数据集包含适用于数十种其他语言的平行语句(parallel sentences),即英语语句与对应其他语言语句的配对语料。该数据集的绝大多数语料源自[OPUS网站](https://opus.nlpl.eu/)。特别地,本数据集包含[News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary)数据集。
## 相关数据集
以下数据集同样隶属于平行语句数据集合集:
* 平行语句-欧洲议会语料(parallel-sentences-europarl):[https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl]
* 平行语句-全球之声语料(parallel-sentences-global-voices):[https://huggingface.co/datasets/sentence-transformers/parallel-sentences-global-voices]
* 平行语句-MUSE语料(parallel-sentences-muse):[https://huggingface.co/datasets/sentence-transformers/parallel-sentences-muse]
* 平行语句-JW300语料(parallel-sentences-jw300):[https://huggingface.co/datasets/sentence-transformers/parallel-sentences-jw300]
* 平行语句-新闻评论语料(parallel-sentences-news-commentary):[https://huggingface.co/datasets/sentence-transformers/parallel-sentences-news-commentary]
* 平行语句-开放字幕语料(parallel-sentences-opensubtitles):[https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opensubtitles]
* 平行语句-演讲语料(parallel-sentences-talks):[https://huggingface.co/datasets/sentence-transformers/parallel-sentences-talks]
* 平行语句-Tatoeba语料(parallel-sentences-tatoeba):[https://huggingface.co/datasets/sentence-transformers/parallel-sentences-tatoeba]
* 平行语句-WikiMatrix语料(parallel-sentences-wikimatrix):[https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikimatrix]
* 平行语句-WikiTitles语料(parallel-sentences-wikititles):[https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikititles]
* 平行语句-CCMatrix语料(parallel-sentences-ccmatrix):[https://huggingface.co/datasets/sentence-transformers/parallel-sentences-ccmatrix]
上述数据集可用于训练多语种语句嵌入模型。如需了解更多信息,请参阅[sbert.net - 多语种模型](https://www.sbert.net/examples/training/multilingual/README.html)。
## 数据集子集
### `all` 子集
* 列名:`"english"`、`"non_english"`
* 列数据类型:`str`、`str`
* 示例:
python
{
"english": "Pure interests – expressed through lobbying power – were undoubtedly important to several key deregulation measures in the US, whose political system and campaign-finance rules are peculiarly conducive to the power of specific lobbies.",
"non_english": "Заинтересованные группы, действующие посредством лоббирования власти, явились важными действующими лицами при принятии нескольких ключевых мер по отмене регулирующих норм в США, чья политическая система и правила финансирования кампаний особенно поддаются власти отдельных лобби."
}
* 采集策略:合并本数据集的所有其他子集
* 去重状态:否
### `en-…` 子集
* 列名:`"english"`、`"non_english"`
* 列数据类型:`str`、`str`
* 示例:
python
{
"english": "Last December, many gold bugs were arguing that the price was inevitably headed for $2,000.",
"non_english": "Lo scorso dicembre, molti fanatici dell’oro sostenevano che il suo prezzo era inevitabilmente destinato a raggiungere i 2000 dollari."
}
* 采集策略:先处理[parallel-sentences](https://huggingface.co/datasets/sentence-transformers/parallel-sentences)的原始数据,以Parquet格式进行格式化,随后执行去重操作
* 去重状态:是
提供机构:
maas
创建时间:
2025-01-06



