YouTube-Commons
收藏魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/PleIAs/YouTube-Commons
下载链接
链接失效反馈官方服务:
资源简介:
# 📺 YouTube-Commons 📺
**YouTube-Commons** is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.
## Content
The collection comprises 22,709,724 original and automatically translated transcripts from 3,156,703 videos (721,136 individual channels).
In total, this represents nearly 45 billion words (44,811,518,375).
All the videos where shared on YouTube with a CC-BY license: the dataset provide all the necessary provenance information including the title, link, channel name and upload date.
The corpus is multilingual with a majority of English-speaking content (71%) for original languages. Automated translations are provided for nearly all the videos in English, French, Spanish, German, Russian, Italian and Dutch.
## Uses
The collection aims to expand the availability of conversational data for research in AI, computational social science and digital humanities.
Most of the available resources under free licenses are written texts such as public domain works or open science articles.
The text can be used for training model and republished with for reproducibility purposes.
## License and ethics
All the transcripts are part of a video shared under a CC-By license. In accordance with the provision of the license, every YouTube channels is fully credited.
While content under a free license can be lawfully reproduced in any setting, there is currently a debate over the legitimacy and proper ethical use of free content for pre-training large language models.
In accordance with the philosophy of Creative Commons, we recommend that this set be preferably used for open research. Furthermore, the license requires that contribution of each individual author is properly credited. In a research context, the best way to achieve this aim would be to fully release the data sources used for training or, at the very least, provide an extensive open documentation.
## Future developments
The collection is far from covering the total amount of available YouTube videos under a Creative Commons license. We will continue to expand it significantly.
Other additional release will also focus on transcripts from other video sources not available on YouTube (especially from public service/university websites).
## Acknowledgements
The corpus was stored and processed with the generous support of Scaleway. It was built up with the support and concerted efforts of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM, as part of the prefiguration of the service offering of the Alliance for Language technologies EDIC (ALT-EDIC).
Pleias corpus collection projects have been also facilitated thanks to the open science LLM community support, insights and cooperation (Occiglot, Eleuther AI, Allen AI).
<div style="text-align: center;">
<img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
</div>
# 📺 YouTube-共享语料库 📺
**YouTube-共享语料库(YouTube-Commons)** 是一个收录了2,063,066个在YouTube平台以知识共享署名许可(Creative Commons Attribution,简称CC-BY)发布的视频的音频转录文本的数据集。
## 内容概况
本语料库包含来自3,156,703个视频(覆盖721,136个独立频道)的22,709,724条原始转录文本与自动翻译转录文本。总文本量接近450亿字(具体为44,811,518,375字)。
所有收录视频均在YouTube以CC-BY许可发布,本数据集提供全部必要的来源溯源信息,包括视频标题、链接、所属频道名称及上传日期。
该语料库为多语言语料库,原始语言内容以英语为主(占比71%);针对英语、法语、西班牙语、德语、俄语、意大利语及荷兰语的几乎全部视频,均提供了自动翻译版本。
## 数据集用途
本数据集旨在拓展对话数据的可及性,以支撑人工智能(AI)、计算社会科学及数字人文领域的研究。当前多数受免费许可保护的可用资源多为书面文本,例如公共领域作品或开放科学期刊文章。本转录文本可用于模型训练,且可重新发布以保障研究可复现性。
## 许可与伦理规范
所有转录文本均来自以CC-BY许可发布的视频,遵循该许可条款,所有参与的YouTube频道均会被完整标注来源。尽管免费许可下的内容可在任何场景下合法复制,但目前学界围绕使用免费内容预训练大语言模型(Large Language Model,LLM)的合法性与伦理使用边界仍存在争议。秉承知识共享(Creative Commons)的核心理念,我们推荐本数据集优先用于开放研究。此外,该许可要求对每位原创作者的贡献进行明确标注。在研究场景中,实现这一目标的最佳方式是完整公开训练所用的数据源,或至少提供详尽的开放文档说明。
## 未来发展规划
当前语料库尚未覆盖所有采用知识共享许可的YouTube视频,我们将持续大幅拓展其规模。后续发布还将聚焦于非YouTube平台的其他视频源转录文本(尤其是公共服务/高校网站的视频转录内容)。
## 致谢
本语料库的存储与处理得到了Scaleway的慷慨支持。本数据集由法国国家级初创企业LANGU:IA(État初创企业)牵头开发,并获得法国文化部与法国数字事务总局(DINUM)的支持,属于语言技术联盟EDIC(Alliance for Language technologies EDIC,ALT-EDIC)服务预筹备项目的一部分。此外,开放科学大语言模型社区(包括Occiglot、Eleuther AI、Allen AI)的支持、见解与协作,也推动了Pleias语料库系列项目的开展。
<div style="text-align: center;">
<img src="https://github.com/mch-dd/datasetlogo/blob/main/scaleway.jpeg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/ministere.png?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
<img src="https://github.com/mch-dd/datasetlogo/blob/main/occiglot.jpg?raw=true" style="width: 33%; margin: 0 auto; display: inline-block;"/>
</div>
提供机构:
maas
创建时间:
2025-06-19
搜集汇总
数据集介绍

背景与挑战
背景概述
YouTube-Commons是一个包含206万多个CC-BY许可YouTube视频的音频转录数据集,涵盖45亿单词和多种语言(71%为英语)。该数据集旨在为AI研究和开放科学提供丰富的对话数据,同时要求遵循CC-BY许可的署名要求。
以上内容由遇见数据集搜集并总结生成



