nuvocare/Ted2020_en_es_fr_de_it_ca_pl_ru_nl

Name: nuvocare/Ted2020_en_es_fr_de_it_ca_pl_ru_nl
Creator: nuvocare
Published: 2025-02-16 00:04:59
License: 暂无描述

Hugging Face2025-02-16 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/nuvocare/Ted2020_en_es_fr_de_it_ca_pl_ru_nl

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: validation path: data/validation-* dataset_info: features: - name: de dtype: string - name: en dtype: string - name: es dtype: string - name: fr dtype: string - name: it dtype: string - name: nl dtype: string - name: pl dtype: string - name: ru dtype: string splits: - name: train num_bytes: 191053803 num_examples: 258098 - name: test num_bytes: 4930156 num_examples: 7213 - name: validation num_bytes: 4326695 num_examples: 6049 download_size: 116856833 dataset_size: 200310654 language: - en - es - fr - de - it - ca - pl - ru - nl --- # Dataset Card for "Ted2020_en_es_fr_de_it_ca_pl_ru_nl" This dataset is an extract of the TED2020 corpora focusing only on english, french, german, italian, polish, russian and dutch. It is used for the purpose of building multilingual biomedical language models. Teacher model is asked to encode the english sentence. Student model is asked to encode other sentences by minimizng the euclidean distance with the teacher encoding. [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

The Ted2020_en_es_fr_de_it_ca_pl_ru_nl dataset is an extract of the TED2020 corpora focusing on English, French, German, Italian, Polish, Russian, and Dutch. It is used for building multilingual biomedical language models, where the teacher model encodes English sentences and the student model encodes other sentences by minimizing the Euclidean distance with the teacher encoding.

提供机构：

nuvocare

原始信息汇总

数据集卡片 "Ted2020_en_es_fr_de_it_ca_pl_ru_nl"

数据集概述

该数据集是从TED2020语料库中提取的，专注于英语、法语、德语、意大利语、波兰语、俄语和荷兰语。

数据集用途

用于构建多语言生物医学语言模型。

数据集配置

默认配置：
- 训练集：路径为 data/train-*
- 测试集：路径为 data/test-*
- 验证集：路径为 data/validation-*

数据集特征

特征名称及数据类型：
- de：字符串
- en：字符串
- es：字符串
- fr：字符串
- it：字符串
- nl：字符串
- pl：字符串
- ru：字符串

数据集分割

训练集：
- 字节数：191053803
- 样本数：258098
测试集：
- 字节数：4930156
- 样本数：7213
验证集：
- 字节数：4326695
- 样本数：6049

数据集大小

下载大小：116856833 字节
数据集大小：200310654 字节

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是从TED2020语料库中提取的多语言平行语料，包含英语、西班牙语、法语、德语、意大利语、加泰罗尼亚语、波兰语、俄语和荷兰语九种语言的文本对齐。数据集规模为27.1万行，主要用于构建多语言生物医学语言模型，采用教师-学生模型训练方法，其中教师模型编码英语句子，学生模型通过最小化与教师编码的欧几里得距离来学习其他语言的表示。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集