Dalision/Omni2Sound_Result
收藏Hugging Face2026-04-24 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Dalision/Omni2Sound_Result
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- en
tags:
- audio-generation
- evaluation
- video-to-audio
- text-to-audio
- benchmark-results
task_categories:
- text-to-audio
---
<h1 align="center">Omni2Sound Evaluation Results</h1>
<p align="center">
<a href="https://arxiv.org/pdf/2601.02731"><img src="https://img.shields.io/badge/arXiv-2601.02731-red"></a>
<a href="https://omni2sound.github.io/"><img src="https://img.shields.io/badge/Project-Page-blue"></a>
<a href="https://github.com/omni2sound/Omni2Sound"><img src="https://img.shields.io/badge/GitHub-Code-black"></a>
<a href="https://huggingface.co/Dalision/Omni2Sound"><img src="https://img.shields.io/badge/HF-Model-yellow"></a>
</p>
<p align="center">
<b>CVPR 2026 (Highlight)</b>
</p>
## Overview
This repository contains the evaluation results of [Omni2Sound](https://huggingface.co/Dalision/Omni2Sound) on three sub-tasks:
- **VT2A** (Video + Text → Audio)
- **V2A** (Video → Audio)
- **T2A** (Text → Audio)
All results are evaluated on the [VGGSound-Omni benchmark](https://huggingface.co/datasets/Dalision/Omni2Sound_Benchmark) and stored as JSON files for reproducibility.
## Evaluation Setup
**Benchmark**: [Dalision/Omni2Sound_Benchmark](https://huggingface.co/datasets/Dalision/Omni2Sound_Benchmark) (VGGSound-Omni)
**Evaluation Toolkit**: [AV-Benchmark](https://github.com/hkchengrex/av-benchmark) — the standardized evaluation toolkit from MMAudio, applied on 8-second clips following prior work.
**Metrics** cover four dimensions:
| Dimension | Metrics |
|---|---|
| Distribution Matching | FAD, FD, FD_PaSST, KL, KL_PaSST |
| Audio Quality | IS, IS_PaSST, PQ (Production Quality) |
| Semantic Alignment | CLAP, MS-CLAP (text-audio), IB / ImageBind (video-audio) |
| Temporal Alignment | DS / Desynchronization Score (Synchformer) |
All baseline models are re-evaluated using their official checkpoints with the same standardized toolkit and identical video/text conditions for fair comparison.
## Links
- **Model**: [Dalision/Omni2Sound](https://huggingface.co/Dalision/Omni2Sound)
- **Benchmark & Dataset**: [Dalision/Omni2Sound_Benchmark](https://huggingface.co/datasets/Dalision/Omni2Sound_Benchmark)
- **Evaluation Toolkit**: [hkchengrex/av-benchmark](https://github.com/hkchengrex/av-benchmark)
- **Paper**: [arXiv:2601.02731](https://arxiv.org/pdf/2601.02731)
- **Project Page**: [omni2sound.github.io](https://omni2sound.github.io/)
- **Code**: [github.com/omni2sound/Omni2Sound](https://github.com/omni2sound/Omni2Sound)
## Citation
```bibtex
@article{dai2026omni2sound,
title = {Omni2Sound: Towards Unified Video-Text-to-Audio Generation},
author = {Dai, Yusheng and Chen, Zehua and Jiang, Yuxuan and Gao, Baolong and
Ke, Qiuhong and Cai, Jianfei and Zhu, Jun },
journal = {arXiv preprint arXiv:2601.02731},
year = {2026}
}
```
## License
Released under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/) (non-commercial use only).
提供机构:
Dalision



