MixtureVitae-VALID

Name: MixtureVitae-VALID
Creator: maas
Published: 2025-11-27 16:52:26
License: 暂无描述

魔搭社区2025-11-27 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/ontocord/MixtureVitae-VALID

下载链接

链接失效反馈

官方服务：

资源简介：

## Licenses All videos in VALID are CC BY, as declared by their original uploaders on YouTube. We publish the audio snippets of these videos and select image frames here under these rights and under the principles of fair use. However, we cannot guarantee that original uploaders had the rights to share the content. This dataset has only been lightly filtered for safety by removing data records with high proportions of children related words AND high proportions of sexual or violence related words. Moreover, we disclaim all warranties, whether express or implied and all laibilities with respect to infringment, fitness for a particular puprpose, or otherwise. ## Intended Uses - **Primary Use Case**: Training models for multimodal understanding, such as contrastive multimodal learning (e.g., CLIP, CLAP). - **Not Recommended For**: Generation tasks, as the dataset's quality may not meet generative model requirements. ## Dataset Limitations - **Quality**: Images and audio are sourced from YouTube and may vary in resolution and clarity. - **Rights Uncertainty**: While videos are marked as CC-BY by the third party authors of the videos, original rights may not be verifiable. - **Biases**: The dataset's multilingual audio paired with English-only text may introduce linguistic biases. The large variety of videos may introduce bias. ## Ethical Considerations The dataset was built under the principles of fair use and CC-BY licensing. Its creation strives to align with the spirit of the EU AI Act, emphasizing transparency and safety in AI model development. Users must exercise caution and adhere to copyright and licensing rules when using VALID. ------ ## Policy for Managing Video Deletion Requests Our goal is to establish a clear process for removing videos from our dataset when requested by users or required by external factors, while balancing the rights of content owners, compliance with CC-BY licenses, and the community's ability to utilize the dataset for training and research purposes. - **1. Respecting Content Owners' Rights:** All videos in the dataset are under the CC-BY license. As such, proper attribution will always be maintained as required by the license. If a content owner requests the removal of a video from the dataset, we will balance this request with the community's ability to train on the data, considering the original intent of the CC-BY license. - **2. Deletion Request Process:** - Content owners or users can request the removal of a video by FIRST requesting it be removed from Youtube: [Here](https://support.google.com/youtube/answer/2807622?) and [Here](https://support.google.com/youtube/answer/2801895?hl=en). - Then the onwers or users should verify that it has been removed from YouTube and provide this fact in a feedback to us [Here](https://forms.gle/f4zYzZpJU78SBPho9). - Requests must demonstrate that the video is no longer publicly available on YouTube. - We will remove the videos confirmed to be deleted in the next release of this dataset. - **3. Verification and Balancing Interests:** All deletion requests will be verified by checking YouTube to ensure the video is no longer available. We may also remove a video in our sole discretion. Decisions on video removal will take into account: - The rights and wishes of content owners, including their ability to remove their videos from public availability. - The community's need for robust datasets for training and research. - The spirit of the CC-BY license, which permits redistribution and use with proper attribution. - **4. Responsibilities for Derivative Datasets:** Users creating derivative datasets must ensure compliance by deleting videos listed in `delete_these_videos.json`. - **5. Proactive Deletion:** Videos may be removed proactively under the following circumstances: - Requests from the hosting provider (e.g., Hugging Face). - Legal requirements or enforcement actions. - Internal decisions. - **6. Community Considerations:** - The community is encouraged to respect the balance between individual content owners’ wishes and the public benefit derived from open access datasets. - Efforts will be made to keep the dataset robust while honoring legitimate requests for content removal. - **7. Updates:** Users are encouraged to check the `delete_these_videos.json`, from time to time to ensure their copy of the dataset is up to date. ------ ## Related Materials: - If you are looking for CC-BY Youtube transcripts of videos, check out PleIAs’ [YouTube-Commons](https://huggingface.co/datasets/PleIAs/YouTube-Commons). - Also, Huggingface has created an excellent CC-BY Youtube video dataset here: [Finevideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo) - LAION is also building a dataset [Here](https://huggingface.co/datasets/laion/laion-audio-preview) which includes Youtube audio snippets paired with Gemini generated captions. ## Acknowledgement and Thanks This dataset was built by Ontocord.AI in cooperation with Grass and LAION.AI. It was created as part of our SafeLLM/Aurora-M2 project in order to build safe multimodal models that comply with the EU AI Act. This dataset was built on a subset of the Grass Video Repository, a massive video dataset of creative commons videos. We deeply thank Huggingface and the open source community for their support. ## About the Contributors: - [**Grass**](https://www.getgrass.io/) is committed to making the public web accessible again. Through its network of millions of globally distributed nodes, it is capable of collecting petabyte-scale datasets for a variety of use cases, including training AI models. The network is run exclusively by users who have downloaded an application to their devices, allowing them to contribute their unused internet bandwidth to the network. On X: @getgrass_io - [**LAION**](https://www.laion.ai), is a non-profit organization, that provides datasets, tools and models to liberate machine learning research. By doing so, we encourage open public education and a more environment-friendly use of resources by reusing existing datasets and models. - [**Ontocord**](https://www.ontocord.ai/ ) is dedicated to making legally compliant AI. Our mission is to make our AGI future lawful and accessible to everyone. - [**Alignment Lab AI**](https://x.com/alignment_lab): Our mission is to build a future leveraging AI as a force for good and as a tool that enhances human lives. We believe everyone deserves to harness the power of personal intelligence. - And many others ... ## Citation ``` @misc{Huu2024VALID, title = {VALID (Video-Audio Large Interleaved Dataset)}, author = {Huu Nguyen, Ken Tsui, Andrej Radonjic, Christoph Schuhmann}, year = {2024} url = {https://huggingface.co/datasets/ontocord/VALID}, }

授权条款 VALID中的所有视频均采用CC BY（Creative Commons Attribution，知识共享署名许可）授权，正如其原始上传者在YouTube上声明的那样。我们在此根据这些授权条款及合理使用原则，发布这些视频的音频片段与精选图像帧。但我们无法保证原始上传者拥有分享其内容的合法权利。本数据集仅经过轻度安全过滤，移除了同时包含高比例儿童相关词汇与高比例性或暴力相关词汇的数据记录。此外，我们不承担任何明示或默示的担保，以及与侵权、特定用途适用性或其他相关的所有责任。预期用途 - **主要应用场景**：用于多模态理解模型的训练，例如对比式多模态学习（如CLIP、CLAP）。 - **不建议用于**：生成任务，因本数据集的质量可能无法满足生成式模型的要求。数据集局限性 - **质量问题**：图像与音频素材均源自YouTube，分辨率与清晰度存在差异。 - **权利不确定性**：尽管视频被第三方作者标记为CC-BY授权，但原始权利未必可验证。 - **偏差问题**：本数据集的多语言音频搭配仅英文文本的结构，可能引入语言偏差；视频类型的多样性也可能引入偏差。伦理考量本数据集基于合理使用原则与CC-BY授权许可构建，其创建旨在契合《欧盟人工智能法案》（EU AI Act）的精神，强调AI模型开发过程中的透明度与安全性。使用者在使用VALID数据集时需谨慎行事，并遵守版权与授权相关规定。 ------ 视频删除请求管理政策我们的目标是建立清晰的流程，以响应使用者提出的删除请求或因外部因素要求移除数据集中的视频，同时平衡内容所有者的权利、CC-BY授权的合规性，以及社区利用该数据集开展训练与研究的能力。 1. **尊重内容所有者权利**：数据集中的所有视频均采用CC-BY授权，因此将始终按照授权要求保留正确的署名。若内容所有者请求移除数据集中的某一视频，我们将结合CC-BY授权的原始意图，平衡该请求与社区开展数据训练的需求。 2. **删除请求流程**： - 内容所有者或使用者需先请求YouTube移除该视频：[此处](https://support.google.com/youtube/answer/2807622?) 与 [此处](https://support.google.com/youtube/answer/2801895?hl=en)。 - 随后，所有者或使用者需确认该视频已从YouTube移除，并通过[反馈链接](https://forms.gle/f4zYzZpJU78SBPho9)向我们提供该证明。 - 请求必须证明该视频已不再在YouTube公开可用。 - 我们将在下一版数据集发布时，移除经确认已删除的视频。 3. **验证与利益平衡**：所有删除请求都将通过核查YouTube平台确认视频是否已不可用。我们也可自行决定移除视频。视频移除的决策将考量以下因素： - 内容所有者的权利与意愿，包括其将视频从公开发布中移除的能力。 - 社区对用于训练与研究的高质量数据集的需求。 - CC-BY授权的精神，即允许在保留正确署名的前提下进行再分发与使用。 4. **衍生数据集的责任**：创建衍生数据集的使用者必须确保合规，删除`delete_these_videos.json`中列出的视频。 5. **主动删除**：在以下情形下，我们可能主动移除视频： - 托管平台（如Hugging Face）的请求。 - 法律要求或执法行动。 - 内部决策。 6. **社区考量**： - 鼓励社区尊重内容所有者的意愿与开放数据集带来的公共利益之间的平衡。 - 我们将在遵守合法内容移除请求的同时，尽力维持数据集的完整性。 7. **更新通知**：鼓励使用者定期查看`delete_these_videos.json`，以确保其持有的数据集副本为最新版本。 ------ 相关参考资料 - 若您需要YouTube视频的CC-BY转录文本，可访问PleIAs发布的[YouTube-Commons](https://huggingface.co/datasets/PleIAs/YouTube-Commons)数据集。 - 此外，Huggingface还推出了优质的CC-BY授权YouTube视频数据集[Finevideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo)。 - LAION也正在构建包含YouTube音频片段与Gemini生成字幕的数据集：[此处](https://huggingface.co/datasets/laion/laion-audio-preview)。致谢与感谢本数据集由Ontocord.AI与Grass、LAION.AI合作构建，作为我们安全大语言模型（SafeLLM）/Aurora-M2项目的一部分，旨在打造符合《欧盟人工智能法案》要求的安全多模态模型。本数据集基于Grass视频库的子集构建，后者是一个包含海量知识共享授权视频的大型数据集。我们衷心感谢Huggingface与开源社区的支持。关于贡献方 - [**Grass**](https://www.getgrass.io/) 致力于让公共网络重新实现可访问性。通过其遍布全球的数百万节点网络，该平台能够收集拍字节级别的数据集，适用于包括AI模型训练在内的多种场景。该网络仅由下载了其应用程序的用户运营，用户可将闲置的互联网带宽贡献至网络。官方X账号：@getgrass_io - [**LAION**](https://www.laion.ai) 是一家非营利组织，致力于提供数据集、工具与模型以推动机器学习研究的开放化。通过复用现有数据集与模型，我们鼓励开放公共教育，并推动更环保的资源使用方式。 - [**Ontocord**](https://www.ontocord.ai/ ) 致力于打造合法合规的AI技术，其使命是让通用人工智能（AGI）的未来合法且人人可及。 - [**Alignment Lab AI**](https://x.com/alignment_lab)：我们的使命是构建以善为导向的AI未来，将AI作为提升人类生活的工具。我们坚信人人都应能驾驭个人智能的力量。 - 以及其他众多贡献者…… 引用格式 @misc{Huu2024VALID, title = {VALID (Video-Audio Large Interleaved Dataset)}, author = {Huu Nguyen, Ken Tsui, Andrej Radonjic, Christoph Schuhmann}, year = {2024} url = {https://huggingface.co/datasets/ontocord/VALID}, }

提供机构：

maas

创建时间：

2025-10-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集