nuvocare/Ted2020_en_es_fr_de_it_ca_pl_ru_nl
收藏Hugging Face2025-02-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/nuvocare/Ted2020_en_es_fr_de_it_ca_pl_ru_nl
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
- split: validation
path: data/validation-*
dataset_info:
features:
- name: de
dtype: string
- name: en
dtype: string
- name: es
dtype: string
- name: fr
dtype: string
- name: it
dtype: string
- name: nl
dtype: string
- name: pl
dtype: string
- name: ru
dtype: string
splits:
- name: train
num_bytes: 191053803
num_examples: 258098
- name: test
num_bytes: 4930156
num_examples: 7213
- name: validation
num_bytes: 4326695
num_examples: 6049
download_size: 116856833
dataset_size: 200310654
language:
- en
- es
- fr
- de
- it
- ca
- pl
- ru
- nl
---
# Dataset Card for "Ted2020_en_es_fr_de_it_ca_pl_ru_nl"
This dataset is an extract of the TED2020 corpora focusing only on english, french, german, italian, polish, russian and dutch.
It is used for the purpose of building multilingual biomedical language models.
Teacher model is asked to encode the english sentence.
Student model is asked to encode other sentences by minimizng the euclidean distance with the teacher encoding.
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
The Ted2020_en_es_fr_de_it_ca_pl_ru_nl dataset is an extract of the TED2020 corpora focusing on English, French, German, Italian, Polish, Russian, and Dutch. It is used for building multilingual biomedical language models, where the teacher model encodes English sentences and the student model encodes other sentences by minimizing the Euclidean distance with the teacher encoding.
提供机构:
nuvocare
原始信息汇总
数据集卡片 "Ted2020_en_es_fr_de_it_ca_pl_ru_nl"
数据集概述
该数据集是从TED2020语料库中提取的,专注于英语、法语、德语、意大利语、波兰语、俄语和荷兰语。
数据集用途
用于构建多语言生物医学语言模型。
数据集配置
- 默认配置:
- 训练集:路径为
data/train-* - 测试集:路径为
data/test-* - 验证集:路径为
data/validation-*
- 训练集:路径为
数据集特征
- 特征名称及数据类型:
de:字符串en:字符串es:字符串fr:字符串it:字符串nl:字符串pl:字符串ru:字符串
数据集分割
- 训练集:
- 字节数:191053803
- 样本数:258098
- 测试集:
- 字节数:4930156
- 样本数:7213
- 验证集:
- 字节数:4326695
- 样本数:6049
数据集大小
- 下载大小:116856833 字节
- 数据集大小:200310654 字节
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是从TED2020语料库中提取的多语言平行语料,包含英语、西班牙语、法语、德语、意大利语、加泰罗尼亚语、波兰语、俄语和荷兰语九种语言的文本对齐。数据集规模为27.1万行,主要用于构建多语言生物医学语言模型,采用教师-学生模型训练方法,其中教师模型编码英语句子,学生模型通过最小化与教师编码的欧几里得距离来学习其他语言的表示。
以上内容由遇见数据集搜集并总结生成



