five

PARADE_audio

收藏
魔搭社区2026-01-06 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/UCSC-VLAA/PARADE_audio
下载链接
链接失效反馈
官方服务:
资源简介:
# AHELM: A Holistic Evaluation of Audio-Language Models This repository contains datasets used in **AHELM: A Holistic Evaluation of Audio-Language Models**. **Paper**: [AHELM: A Holistic Evaluation of Audio-Language Models](https://huggingface.co/papers/2508.21376) **Project Page**: [https://crfm.stanford.edu/helm/audio/v1.0.0/](https://crfm.stanford.edu/helm/audio/v1.0.0/) **Code (HELM framework)**: [https://github.com/stanford-crfm/helm](https://github.com/stanford-crfm/helm) AHELM is a benchmark designed to holistically measure the performance of Audio-Language Models (ALMs) across 10 key aspects: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. It aggregates various datasets, including two new synthetic audio-text datasets: * **PARADE**: Evaluates ALMs on avoiding stereotypes. * **CoRe-Bench**: Measures reasoning over conversational audio through inferential multi-turn question answering. The benchmark standardizes prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. All raw prompts, model generations, and outputs are available on the project website. ### Sample Usage The datasets in this repository are used by the HELM (Holistic Evaluation of Language Models) framework. You can use the `crfm-helm` package to run evaluations. First, install the package: ```sh pip install crfm-helm ``` Then, you can run and summarize benchmarks: ```sh # Run benchmark (example for MMLU, adapt run-entries for AHELM specific evaluations) helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10 # Summarize benchmark results helm-summarize --suite my-suite # Start a web server to display benchmark results helm-server --suite my-suite ``` Then go to `http://localhost:8000/` in your browser.

# AHELM:音频语言模型全景评测 本仓库包含用于**AHELM:音频语言模型全景评测**的相关数据集。 **论文**:[AHELM: A Holistic Evaluation of Audio-Language Models](https://huggingface.co/papers/2508.21376) **项目主页**:[https://crfm.stanford.edu/helm/audio/v1.0.0/](https://crfm.stanford.edu/helm/audio/v1.0.0/) **代码(HELM框架)**:[https://github.com/stanford-crfm/helm](https://github.com/stanford-crfm/helm) AHELM是一款专为全景评测音频语言模型(Audio-Language Models,简称ALMs)性能而设计的基准测试集,覆盖10大核心维度:音频感知、知识储备、逻辑推理、情感识别、偏见、公平性、多语言性、鲁棒性、毒性与安全性。该基准整合了多类现有数据集,同时新增了两个合成音频-文本数据集: * **PARADE**:用于评测ALMs规避刻板印象的能力。 * **CoRe-Bench**:通过推理式多轮问答,评测ALMs针对会话音频的逻辑推理能力。 该基准通过标准化提示词、推理参数与评测指标,确保不同模型间的评测结果具备公平可比性。所有原始提示词、模型生成内容与输出结果均可在项目主页获取。 ### 示例用法 本仓库中的数据集可借助大语言模型全景评测(Holistic Evaluation of Language Models,简称HELM)框架使用,你可通过`crfm-helm`工具包运行评测任务。 首先安装该工具包: sh pip install crfm-helm 随后即可运行并汇总评测结果: sh # 运行基准测试(以MMLU为例,需将运行条目适配为AHELM专属评测任务) helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10 # 汇总基准测试结果 helm-summarize --suite my-suite # 启动Web服务器以展示评测结果 helm-server --suite my-suite 随后在浏览器中访问`http://localhost:8000/`即可查看结果。
提供机构:
maas
创建时间:
2025-04-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作