ted_multi
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/neulab/ted_multi
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "ted_multi"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://github.com/neulab/word-embeddings-for-nmt](https://github.com/neulab/word-embeddings-for-nmt)
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 352.23 MB
- **Size of the generated dataset:** 791.01 MB
- **Total amount of disk used:** 1.14 GB
### Dataset Summary
Massively multilingual (60 language) data set derived from TED Talk transcripts.
Each record consists of parallel arrays of language and text. Missing and
incomplete translations will be filtered out.
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### plain_text
- **Size of downloaded dataset files:** 352.23 MB
- **Size of the generated dataset:** 791.01 MB
- **Total amount of disk used:** 1.14 GB
An example of 'validation' looks as follows.
```
This example was too long and was cropped:
{
"talk_name": "shabana_basij_rasikh_dare_to_educate_afghan_girls",
"translations": "{\"language\": [\"ar\", \"az\", \"bg\", \"bn\", \"cs\", \"da\", \"de\", \"el\", \"en\", \"es\", \"fa\", \"fr\", \"he\", \"hi\", \"hr\", \"hu\", \"hy\", \"id\", \"it\", ..."
}
```
### Data Fields
The data fields are the same among all splits.
#### plain_text
- `translations`: a multilingual `string` variable, with possible languages including `ar`, `az`, `be`, `bg`, `bn`.
- `talk_name`: a `string` feature.
### Data Splits
| name |train |validation|test|
|----------|-----:|---------:|---:|
|plain_text|258098| 6049|7213|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Citation Information
```
@InProceedings{qi-EtAl:2018:N18-2,
author = {Qi, Ye and Sachan, Devendra and Felix, Matthieu and Padmanabhan, Sarguna and Neubig, Graham},
title = {When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?},
booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)},
month = {June},
year = {2018},
address = {New Orleans, Louisiana},
publisher = {Association for Computational Linguistics},
pages = {529--535},
abstract = {The performance of Neural Machine Translation (NMT) systems often suffers in low-resource scenarios where sufficiently large-scale parallel corpora cannot be obtained. Pre-trained word embeddings have proven to be invaluable for improving performance in natural language analysis tasks, which often suffer from paucity of data. However, their utility for NMT has not been extensively explored. In this work, we perform five sets of experiments that analyze when we can expect pre-trained word embeddings to help in NMT tasks. We show that such embeddings can be surprisingly effective in some cases -- providing gains of up to 20 BLEU points in the most favorable setting.},
url = {http://www.aclweb.org/anthology/N18-2084}
}
```
### Contributions
Thanks to [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun), [@patrickvonplaten](https://github.com/patrickvonplaten) for adding this dataset.
# 数据集卡片:“ted_multi”
## 目录
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与评测基准](#supported-tasks-and-leaderboards)
- [语言覆盖](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据拆分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献者](#contributions)
## 数据集描述
- **主页:** [https://github.com/neulab/word-embeddings-for-nmt](https://github.com/neulab/word-embeddings-for-nmt)
- **代码仓库:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **相关论文:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **联系人:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **下载的数据集文件大小:** 352.23 MB
- **生成的数据集大小:** 791.01 MB
- **总磁盘占用:** 1.14 GB
### 数据集摘要
该数据集为基于TED演讲文稿构建的超大规模多语言(60种语言)数据集。每条数据由语言与文本的并行数组构成,已过滤掉缺失或不完整的翻译内容。
### 支持任务与评测基准
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 语言覆盖
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集结构
### 数据实例
#### 纯文本
- **下载的数据集文件大小:** 352.23 MB
- **生成的数据集大小:** 791.01 MB
- **总磁盘占用:** 1.14 GB
以下是`validation`拆分的一个示例:
This example was too long and was cropped:
{
"talk_name": "shabana_basij_rasikh_dare_to_educate_afghan_girls",
"translations": "{"language": ["ar", "az", "bg", "bn", "cs", "da", "de", "el", "en", "es", "fa", "fr", "he", "hi", "hr", "hu", "hy", "id", "it", ..."
}
### 数据字段
所有数据拆分的字段格式保持一致。
#### 纯文本
- `translations`:多语言字符串变量,涵盖的语言包括`ar`、`az`、`be`、`bg`、`bn`等。
- `talk_name`:字符串特征。
### 数据拆分
| 拆分名称 | 训练集 | 验证集 | 测试集 |
|----------|--------:|-------:|-------:|
| plain_text | 258098 | 6049 | 7213 |
## 数据集构建
### 构建初衷
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据收集与标准化
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言生产者是谁?
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 标注信息
#### 标注流程
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 标注者是谁?
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据使用注意事项
### 数据集的社会影响
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏差讨论
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限性
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集维护者
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 授权信息
[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 引用信息
@InProceedings{qi-EtAl:2018:N18-2,
author = {Qi, Ye and Sachan, Devendra and Felix, Matthieu and Padmanabhan, Sarguna and Neubig, Graham},
title = {When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?},
booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)},
month = {June},
year = {2018},
address = {New Orleans, Louisiana},
publisher = {Association for Computational Linguistics},
pages = {529--535},
abstract = {神经机器翻译(Neural Machine Translation,简称NMT)系统的性能在低资源场景下往往表现不佳,这类场景中无法获取足够大规模的平行语料库。预训练词嵌入已被证明对提升自然语言分析任务的性能极具价值,而这类任务常面临数据匮乏的问题。但目前针对预训练词嵌入在NMT中的应用价值,尚未得到充分探索。本研究开展了五组实验,分析预训练词嵌入可在何时对NMT任务提供帮助。实验结果表明,在部分场景下这类词嵌入的效果出人意料——在最优配置下可带来最高20个BLEU评分的性能提升。},
url = {http://www.aclweb.org/anthology/N18-2084}
}
### 贡献者
感谢[@thomwolf](https://github.com/thomwolf)、[@lewtun](https://github.com/lewtun)、[@patrickvonplaten](https://github.com/patrickvonplaten)为本数据集的收录提供帮助。
提供机构:
maas
创建时间:
2025-10-10



