five

ted_talks_iwslt

收藏
魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/IWSLT/ted_talks_iwslt
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Web Inventory of Transcribed & Translated(WIT) Ted Talks ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://wit3.fbk.eu/home - **Repository:** https://drive.google.com/file/d/1Cz1Un9p8Xn9IpEMMrg2kXSDt0dnjxc4z/view?usp=sharing - **Paper:** https://www.aclweb.org/anthology/2012.eamt-1.60.pdf - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Mauro Cettolo](mailto:cettolo@fbk.eu) [Roldano Cattoni](mailto:cattoni@fbk.eu) ### Dataset Summary The Web Inventory Talk is a collection of the original Ted talks and their translated version. The translations are available in more than 109+ languages, though the distribution is not uniform. To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs. E.g. `dataset = load_dataset("ted_talks_iwslt", language_pair=("it", "pl"), year="2014")` The full list of languages is: 'af', 'am', 'ar', 'arq', 'art-x-bork', 'as', 'ast', 'az', 'be', 'bg', 'bi', 'bn', 'bo', 'bs', 'ca', 'ceb', 'cnh', 'cs', 'da', 'de', 'el', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fr-ca', 'ga', 'gl', 'gu', 'ha', 'he', 'hi', 'hr', 'ht', 'hu', 'hup', 'hy', 'id', 'ig', 'inh', 'is', 'it', 'ja', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'ltg', 'lv', 'mg', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'nb', 'ne', 'nl', 'nn', 'oc', 'pa', 'pl', 'ps', 'pt', 'pt-br', 'ro', 'ru', 'rup', 'sh', 'si', 'sk', 'sl', 'so', 'sq', 'sr', 'srp', 'sv', 'sw', 'szl', 'ta', 'te', 'tg', 'th', 'tl', 'tlh', 'tr', 'tt', 'ug', 'uk', 'ur', 'uz', 'vi', 'zh', 'zh-cn', 'zh-tw'. The full list of years is: '2014', '2015', '2016'. ### Supported Tasks and Leaderboards machine learning task, language modeling and generation ### Languages Ted talks are mostly held in English (`en`). Almost all of the talks have been translated, by volunteers, into Arabic, Bulgarian, Chinese (simplified), French, Italian, Korean, Portuguese (Brazil) and Spanish. For about 70 other languages, the number of translated talks ranges from several hundreds (e.g. such as other Dutch, German, Hebrew, Romanian) to one (e.g. Hausa, Hupa, Bislama, Ingush, Maltese). The languages in the dataset are: - af - am - ar - arq - art - as - ast - az - be - bg - bi - bn - bo - bs - ca - ceb - cnh - cs - da - de - el - en - eo - es - et - eu - fa - fi - fil - fr - ga - gl - gu - ha - he - hi - hr - ht - hu - hup - hy - id - ig - inh - is - it - ja - ka - kk - km - kn - ko - ku - ky - la - lb - lo - lt - ltg - lv - mg - mk - ml - mn - mr - ms - mt - my - nb - ne - nl - nn - oc - pa - pl - ps - pt - ro - ru - rup - sh - si - sk - sl - so - sq - sr - srp: Serbian (`sr`) - sv - sw - szl - ta - te - tg - th - tl - tlh - tr - tt - ug - uk - ur - uz - vi - zh ## Dataset Structure ### Data Instances One example from the dataset is: ``` {'translation': {'hi': 'जब मार्च २०१४ में इबोला का प्रकोप छाया, पर्डिस सबेटी और उनकी टीम को वाइरस के जीनोम का अनुक्रमण करना था, सीखना था कि यह कैसे परवतिर्त होते हैं और फैलते हैं। सबेटी ने तुरंत ही अपने अनुसंधान को वेब में जारी किया, ताकि दुनिया भर के वाइरस ट्रैकर्स और वैज्ञानिक इस तत्काल लड़ाई में शामिल हो सकें। इस बातचीत में, वह दिखाती हैं कि सबका सहयोग ही कुंजी है वाइरस को रोकने के लिए--और लड़ने के लिए आगे आने वाले हमलों से। सबेटी ने कहा,"हमने खुले तौर पर काम किया, साझा किया और साथ काम किया"। "हमे दुनिया को एक वाइरस के विनाश से नहीं, पर अरबों दिलों और दिमागों की एकता से परिभाषित करना है"।', 'nl': 'Toen Ebola in maart 2014 uitbrak, zijn Pardis Sabeti en haar team aan het werk gegaan om het genoom in kaart te brengen. Zo ontdekten ze hoe het virus zich verspreidde en muteerde. Sabeti zette direct haar onderzoek op het internet, zodat wereldwijd virus-jagers en wetenschappers mee konden werken aan de strijd. In deze talk laat ze zien hoe die openheid geholpen heeft bij het stoppen van het virus en hoe het kan helpen bij de strijd tegen het volgende virus. "We moesten transparant werken, delen en samenwerken". Sabeti zegt:"Laat de wereld niet ten onder gaan aan een virus, maar verlicht worden door miljoenen harten en geesten die samenwerken."'}} ``` The original XML files are formatted like this example: ``` <file id="1"> <head> <url>http://www.ted.com/talks/ryan_holladay_to_hear_this_music_you_have_to_be_there_literally.html</url> <pagesize>66634</pagesize> <dtime>Sun Jan 12 15:17:32 CET 2014</dtime> <content-type>text/html; charset=utf-8</content-type> <encoding>utf-8</encoding> <videourl>http://download.ted.com/talks/RyanHolladay_2013S.mp4</videourl> <videopath>talks/RyanHolladay_2013S.mp4</videopath> <transcription> <seekvideo id="2939">(Music)</seekvideo> <seekvideo id="7555">For any of you who have visited or lived in New York City,</seekvideo> <seekvideo id="11221">these shots might start to look familiar.</seekvideo> <seekvideo id="16116">This is Central Park,</seekvideo> . . . <seekvideo id="361992">for people to interact with</seekvideo> <seekvideo id="363709">and experience music.</seekvideo> <seekvideo id="365451">Thank you.</seekvideo> <seekvideo id="367495">(Applause)</seekvideo> </transcription> <talkid>1903</talkid> <title>Ryan Holladay: To hear this music you have to be there. Literally</title> <description>The music industry ......segments of sounds that only play when a listener is physically nearby. (Filmed at TED@BCG.)</description> <keywords>entertainment,music,technology</keywords> <image>http://images.ted.com/images/ted/d98c17773da6f84e9f915895c270c7ffd2de3778_389x292.jpg</image> <date>2014/01/12</date> <wordnum>885</wordnum> <charnum>5051</charnum> </head> <content>(Music) For any of you who have visited or lived in New York City, these shots might start to look familiar. This is Central Park, ............new ways for people to interact with and experience music. Thank you. (Applause)</content> </file> ``` ### Data Fields The fields of the dataset are: - translation: - <lang1>: text in <lang1> - <lang2>L translated text in <lang2> Information about the original data files: For each language, a single XML file is generated which includes all talks subtitled in that language. Each talk is enclosed in tags `<file id="int">` and `</file>` and includes, among other tags: | Tags | Description | |---|:---| | `<url>`| the address of the original HTML document of the talk | | `<speaker>` | the name of the talk speaker | | `<talkid>` | the numeric talk identifier | | `<transcript>` | talk subtitles split in captions | | `<date>` | the issue date of the talk | | `<content>` | talk subtitles | ### Data Splits The paper doesn't provide any specific train-test-dev splits. However data can be split by available years (2014, 2015, 2016) ## Dataset Creation ### Curation Rationale TED Conference, based in California, has been posting all video recordings of its talks together with subtitles in English and their translations in more than 80 languages. Aside from its cultural and social relevance, this content, which is published under the Creative Commons BYNC-ND license, also represents a precious language resource for the machine translation research community, thanks to its size, variety of topics, and covered languages. ### Source Data #### Initial Data Collection and Normalization The talks were collected from the [Ted Conference website](http://www.ted.com/) #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? Translation has been contributed by volunteers ### Personal and Sensitive Information No personal and sensitive information is provided in the dataset. All talks are publicly available ## Considerations for Using the Data ### Social Impact of Dataset In statistical machine translation, large amount of in-domain parallel data are usually required to properly train translation and reordering models. With more than 900+ Ted talks (as of 2011) and translation in more than 90+ languages. This dataset provides a useful resource for the MT research community. In turn, this enables easy access to a vast treasure trove of human knowledge. ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators The original dataset was curated by: [Mauro Cettolo](mailto:cettolo@fbk.eu) [Roldano Cattoni](mailto:cattoni@fbk.eu) Author: Christian Girardi For issues with the HuggingFace Dataset implementation, reach out: [Aakash Gupta](mailto:aakashg80@gmail.com) ### Licensing Information cc-by-nc-nd-4.0 ### Citation Information ``` @inproceedings{cettolo-etal-2012-wit3, title = "{WIT}3: Web Inventory of Transcribed and Translated Talks", author = "Cettolo, Mauro and Girardi, Christian and Federico, Marcello", booktitle = "Proceedings of the 16th Annual conference of the European Association for Machine Translation", month = may # " 28{--}30", year = "2012", address = "Trento, Italy", publisher = "European Association for Machine Translation", url = "https://www.aclweb.org/anthology/2012.eamt-1.60", pages = "261--268", } ``` ### Contributions Thanks to [@skyprince999](https://github.com/skyprince999) for adding this dataset.

# 转录与翻译网络库存(Web Inventory of Transcribed & Translated,WIT)TED演讲数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概览](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [整理初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集整理者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页**:https://wit3.fbk.eu/home - **仓库**:https://drive.google.com/file/d/1Cz1Un9p8Xn9IpEMMrg2kXSDt0dnjxc4z/view?usp=sharing - **相关论文**:https://www.aclweb.org/anthology/2012.eamt-1.60.pdf - **排行榜**:[需补充更多信息] - **联络人**:[Mauro Cettolo](mailto:cettolo@fbk.eu)、[Roldano Cattoni](mailto:cattoni@fbk.eu) ### 数据集概览 本数据集(转录与翻译网络库存)收录了原始TED演讲及其译版内容,其翻译版本覆盖超过109种语言,但各语言的演讲数量分布并不均衡。 若需加载配置中未内置的语言对,仅需指定语言代码对即可。示例如下: python dataset = load_dataset("ted_talks_iwslt", language_pair=("it", "pl"), year="2014") 完整语言列表如下:'af', 'am', 'ar', 'arq', 'art-x-bork', 'as', 'ast', 'az', 'be', 'bg', 'bi', 'bn', 'bo', 'bs', 'ca', 'ceb', 'cnh', 'cs', 'da', 'de', 'el', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fr-ca', 'ga', 'gl', 'gu', 'ha', 'he', 'hi', 'hr', 'ht', 'hu', 'hup', 'hy', 'id', 'ig', 'inh', 'is', 'it', 'ja', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'ltg', 'lv', 'mg', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'nb', 'ne', 'nl', 'nn', 'oc', 'pa', 'pl', 'ps', 'pt', 'pt-br', 'ro', 'ru', 'rup', 'sh', 'si', 'sk', 'sl', 'so', 'sq', 'sr', 'srp', 'sv', 'sw', 'szl', 'ta', 'te', 'tg', 'th', 'tl', 'tlh', 'tr', 'tt', 'ug', 'uk', 'ur', 'uz', 'vi', 'zh', 'zh-cn', 'zh-tw'。 完整年份列表如下:'2014', '2015', '2016'。 ### 支持任务与排行榜 机器学习任务、语言建模与文本生成 ### 语言 TED演讲的原语言以英语(`en`)为主。绝大多数演讲已由志愿者翻译为阿拉伯语、保加利亚语、简体中文、法语、意大利语、韩语、巴西葡萄牙语以及西班牙语。其余约70种语言的译版演讲数量从数百篇(如荷兰语变体、德语、希伯来语、罗马尼亚语)到仅1篇不等(如豪萨语、胡帕语、比斯拉马语、印古什语、马耳他语)。 数据集中包含的语言如下: - af - am - ar - arq - art - as - ast - az - be - bg - bi - bn - bo - bs - ca - ceb - cnh - cs - da - de - el - en - eo - es - et - eu - fa - fi - fil - fr - ga - gl - gu - ha - he - hi - hr - ht - hu - hup - hy - id - ig - inh - is - it - ja - ka - kk - km - kn - ko - ku - ky - la - lb - lo - lt - ltg - lv - mg - mk - ml - mn - mr - ms - mt - my - nb - ne - nl - nn - oc - pa - pl - ps - pt - ro - ru - rup - sh - si - sk - sl - so - sq - sr - srp: 塞尔维亚语(`sr`) - sv - sw - szl - ta - te - tg - th - tl - tlh - tr - tt - ug - uk - ur - uz - vi - zh ## 数据集结构 ### 数据实例 数据集的一个示例如下: python {'translation': {'hi': 'जब मार्च २०१४ में इबोला का प्रकोप छाया, पर्डिस सबेटी और उनकी टीम को वाइरस के जीनोम का अनुक्रमण करना था, सीखना था कि यह कैसे परवतिर्त होते हैं और फैलते हैं। सबेटी ने तुरंत ही अपने अनुसंधान को वेब में जारी किया, ताकि दुनिया भर के वाइरस ट्रैकर्स और वैज्ञानिक इस तत्काल लड़ाई में शामिल हो सकें। इस बातचीत में, वह दिखाती हैं कि सबका सहयोग ही कुंजी है वाइरस को रोकने के लिए--और लड़ने के लिए आगे आने वाले हमलों से। सबेटी ने कहा,"हमने खुले तौर पर काम किया, साझा किया और साथ काम किया"। "हमे दुनिया को एक वाइरस के विनाश से नहीं, पर अरबों दिलों और दिमागों की एकता से परिभाषित करना है"।', 'nl': 'Toen Ebola in maart 2014 uitbrak, zijn Pardis Sabeti en haar team aan het werk gegaan om het genoom in kaart te brengen. Zo ontdekten ze hoe het virus zich verspreidde en muteerde. Sabeti zette direct haar onderzoek op het internet, zodat wereldwijd virus-jagers en wetenschappers mee konden werken aan de strijd. In deze talk laat ze zien hoe die openheid geholpen heeft bij het stoppen van het virus en hoe het kan helpen bij de strijd tegen het volgende virus. "We moesten transparant werken, delen en samenwerken". Sabeti zegt:"Laat de wereld niet ten onder gaan aan een virus, maar verlicht worden door miljoenen harten en geesten die samenwerken."'}} 原始XML文件的格式如下例所示: xml <file id="1"> <head> <url>http://www.ted.com/talks/ryan_holladay_to_hear_this_music_you_have_to_be_there_literally.html</url> <pagesize>66634</pagesize> <dtime>Sun Jan 12 15:17:32 CET 2014</dtime> <content-type>text/html; charset=utf-8</content-type> <encoding>utf-8</encoding> <videourl>http://download.ted.com/talks/RyanHolladay_2013S.mp4</videourl> <videopath>talks/RyanHolladay_2013S.mp4</videopath> <transcription> <seekvideo id="2939">(Music)</seekvideo> <seekvideo id="7555">For any of you who have visited or lived in New York City,</seekvideo> <seekvideo id="11221">these shots might start to look familiar.</seekvideo> <seekvideo id="16116">This is Central Park,</seekvideo> . . . <seekvideo id="361992">for people to interact with</seekvideo> <seekvideo id="363709">and experience music.</seekvideo> <seekvideo id="365451">Thank you.</seekvideo> <seekvideo id="367495">(Applause)</seekvideo> </transcription> <talkid>1903</talkid> <title>Ryan Holladay: To hear this music you have to be there. Literally</title> <description>The music industry ......segments of sounds that only play when a listener is physically nearby. (Filmed at TED@BCG.)</description> <keywords>entertainment,music,technology</keywords> <image>http://images.ted.com/images/ted/d98c17773da6f84e9f915895c270c7ffd2de3778_389x292.jpg</image> <date>2014/01/12</date> <wordnum>885</wordnum> <charnum>5051</charnum> </head> <content>(Music) For any of you who have visited or lived in New York City, these shots might start to look familiar. This is Central Park, ............new ways for people to interact with and experience music. Thank you. (Applause)</content> </file> ### 数据字段 数据集的字段如下: - 翻译(translation): - <语言1>:<语言1>的文本内容 - <语言2>:<语言2>的译版文本 关于原始数据文件的说明: 针对每种语言,会生成一个单独的XML文件,包含该语言下所有带字幕的演讲。每篇演讲均以`<file id="整数">`与`</file>`标签包裹,除其他标签外,还包含: | 标签 | 描述 | |---|:---| | `<url>`| 演讲原始HTML文档的地址 | | `<speaker>` | 演讲者姓名 | | `<talkid>` | 演讲的数字标识符 | | `<transcript>` | 拆分后的演讲字幕 | | `<date>` | 演讲发布日期 | | `<content>` | 演讲字幕文本 | ### 数据划分 相关论文未提供明确的训练集-测试集-验证集划分方案,但可根据可用年份(2014、2015、2016)对数据进行划分。 ## 数据集构建 ### 整理初衷 TED大会总部位于加利福尼亚州,其发布的所有演讲视频均附带英语字幕及超过80种语言的译版字幕。这些内容采用知识共享BYNC-ND许可协议发布,除具备文化与社会价值外,凭借其规模、主题多样性以及覆盖的语言范围,也成为机器翻译研究领域宝贵的语言资源。 ### 源数据 #### 初始数据采集与标准化 演讲数据采集自[TED大会官网](http://www.ted.com/) #### 源语言生产者信息 [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注者信息 翻译工作由志愿者贡献完成。 ### 个人与敏感信息 数据集中未包含任何个人或敏感信息,所有演讲内容均为公开可得。 ## 数据集使用注意事项 ### 数据集的社会影响 在统计机器翻译领域,训练高质量的翻译与重排序模型通常需要大量领域内平行语料。本数据集收录了超过900篇TED演讲(截至2011年)并支持超过90种语言的翻译版本,为机器翻译研究社区提供了极具价值的资源。同时,该数据集也使得人们能够便捷地获取海量的人类知识宝库。 ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集整理者 原始数据集的整理者为: [Mauro Cettolo](mailto:cettolo@fbk.eu) [Roldano Cattoni](mailto:cattoni@fbk.eu) 作者: Christian Girardi 若需反馈HuggingFace数据集实现相关问题,请联系:[Aakash Gupta](mailto:aakashg80@gmail.com) ### 许可信息 cc-by-nc-nd-4.0 ### 引用信息 bibtex @inproceedings{cettolo-etal-2012-wit3, title = "{WIT}3: Web Inventory of Transcribed and Translated Talks", author = "Cettolo, Mauro and Girardi, Christian and Federico, Marcello", booktitle = "Proceedings of the 16th Annual conference of the European Association for Machine Translation", month = may # " 28{--}30", year = "2012", address = "Trento, Italy", publisher = "European Association for Machine Translation", url = "https://www.aclweb.org/anthology/2012.eamt-1.60", pages = "261--268", } ### 贡献者 感谢[@skyprince999](https://github.com/skyprince999)为本数据集添加至HuggingFace数据集库。
提供机构:
maas
创建时间:
2025-10-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作