five

parallel-sentences-talks

收藏
魔搭社区2025-11-12 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/parallel-sentences-talks
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Parallel Sentences - Talks This dataset contains parallel sentences (i.e. English sentence + the same sentences in another language) for numerous other languages. Most of the sentences originate from the [OPUS website](https://opus.nlpl.eu/). In particular, this dataset contains the [Talks](https://huggingface.co/datasets/sentence-transformers/parallel-sentences) dataset. ## Related Datasets The following datasets are also a part of the Parallel Sentences collection: * [parallel-sentences-europarl](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl) * [parallel-sentences-global-voices](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-global-voices) * [parallel-sentences-muse](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-muse) * [parallel-sentences-jw300](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-jw300) * [parallel-sentences-news-commentary](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-news-commentary) * [parallel-sentences-opensubtitles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opensubtitles) * [parallel-sentences-talks](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-talks) * [parallel-sentences-tatoeba](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-tatoeba) * [parallel-sentences-wikimatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikimatrix) * [parallel-sentences-wikititles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikititles) * [parallel-sentences-ccmatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-ccmatrix) These datasets can be used to train multilingual sentence embedding models. For more information, see [sbert.net - Multilingual Models](https://www.sbert.net/examples/training/multilingual/README.html). ## Dataset Subsets ### `all` subset * Columns: "english", "non_english" * Column types: `str`, `str` * Examples: ```python { 'english': "See, the thing we're doing right now is we're forcing people to learn mathematics.", 'non_english': 'حسناً ان ما نقوم به اليوم .. هو ان نجبر الطلاب لتعلم الرياضيات', } ``` * Collection strategy: Combining all other subsets from this dataset. * Deduplified: No ### `en-...` subsets * Columns: "english", "non_english" * Column types: `str`, `str` * Examples: ```python { 'english': "So I think practicality is one case where it's worth teaching people by hand.", 'non_english': 'Ich denke, dass es sich aus diesem Grund lohnt, den Leuten das Rechnen von Hand beizubringen.', } ``` * Collection strategy: Processing the raw data from [parallel-sentences](https://huggingface.co/datasets/sentence-transformers/parallel-sentences) and formatting it in Parquet, followed by deduplication. * Deduplified: Yes

# 平行语句数据集卡片:Talks 子数据集 本数据集包含面向多语种的平行语句(parallel sentences),即英语语句与对应其他语言的同语义语句,绝大多数语料源自[OPUS网站](https://opus.nlpl.eu/)。本数据集尤其包含[Talks](https://huggingface.co/datasets/sentence-transformers/parallel-sentences)子数据集。 ## 相关数据集 以下数据集同属平行语句合集: * [parallel-sentences-europarl](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-europarl) * [parallel-sentences-global-voices](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-global-voices) * [parallel-sentences-muse](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-muse) * [parallel-sentences-jw300](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-jw300) * [parallel-sentences-news-commentary](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-news-commentary) * [parallel-sentences-opensubtitles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-opensubtitles) * [parallel-sentences-talks](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-talks) * [parallel-sentences-tatoeba](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-tatoeba) * [parallel-sentences-wikimatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikimatrix) * [parallel-sentences-wikititles](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-wikititles) * [parallel-sentences-ccmatrix](https://huggingface.co/datasets/sentence-transformers/parallel-sentences-ccmatrix) 上述数据集可用于训练多语种语句嵌入模型(sentence embedding models)。更多详情请参阅[sbert.net - 多语种模型](https://www.sbert.net/examples/training/multilingual/README.html)。 ## 数据集子集 ### `all` 全量子集 * 字段:`english`、`non_english` * 字段类型:均为字符串(`str`) * 示例: python { 'english': "See, the thing we're doing right now is we're forcing people to learn mathematics.", 'non_english': 'حسناً ان ما نقوم به اليوم .. هو ان نجبر الطلاب لتعلم الرياضيات', } * 构建策略:合并本数据集其余所有子集 * 去重状态:未去重 ### `en-*` 语种对子集 * 字段:`english`、`non_english` * 字段类型:均为字符串(`str`) * 示例: python { 'english': "So I think practicality is one case where it's worth teaching people by hand.", 'non_english': 'Ich denke, dass es sich aus diesem Grund lohnt, den Leuten das Rechnen von Hand beizubringen.', } * 构建策略:对[parallel-sentences](https://huggingface.co/datasets/sentence-transformers/parallel-sentences)的原始数据进行处理并以Parquet格式存储,随后完成去重 * 去重状态:已去重
提供机构:
maas
创建时间:
2025-01-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作