big_bench_audio

Name: big_bench_audio
Creator: maas
Published: 2025-12-05 16:57:17
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/ArtificialAnalysis/big_bench_audio

下载链接

链接失效反馈

官方服务：

资源简介：

# Artificial Analysis Big Bench Audio ## Dataset Description - **Leaderboard:** [https://artificialanalysis.ai/speech-to-speech](https://artificialanalysis.ai/speech-to-speech) - **Point of Contact:** [info@artificialanalysis.ai](mailto:info@artificialanalysis.ai) ### Dataset Summary Big Bench Audio is an audio version of a subset of Big Bench Hard questions. The dataset can be used for evaluating the reasoning capabilities of models that support audio input. The dataset includes 1000 audio recordings for all questions from the following Big Bench Hard categories. Descriptions are taken from [Suzgun et al. (2022)](https://arxiv.org/pdf/2210.09261): - Formal Fallacies Syllogisms Negation (Formal Fallacies) - 250 questions - Given a context involving a set of statements (generated by one of the argument schemes), determine whether an argument—presented informally—can be logically deduced from the provided context - Navigate - 250 questions - Given a series of navigation steps to an agent, determine whether the agent would end up back at its initial starting point. - Object Counting - 250 questions - Given a collection of possessions that a person has along with their quantities (e.g., three pianos, two strawberries, one table, and two watermelons), determine the number of a certain object/item class (e.g., fruits). - Web of Lies - 250 questions - Evaluate the truth value of a random Boolean function expressed as a natural-language word problem. ### Supported Tasks and Leaderboards - `Audio-to-Audio`: The dataset can be used to evaluate instruction tuned audio to audio models. It is also suitable for testing Audio-to-Text pipelines. A leaderboard can be found at [https://artificialanalysis.ai/speech-to-speech](https://artificialanalysis.ai/speech-to-speech) ### Languages All audio recordings are in English. The audio is generated synthetically using 23 voices from top providers on the [Artifical Analysis Speech Arena](https://artificialanalysis.ai/text-to-speech/arena?tab=Leaderboard). ## Dataset Structure ### Data Instances Each instance in the dataset includes four fields: category, official_name, file_name, id ``` { "category":"formal_fallacies", "official_answer":"invalid", "file_name":"data\/question_0.mp3", "id":0 } ``` ### Data Fields - `category`: The associated Big Bench Hard category - `official_answer`: The associated Big Bench Hard answer - `file_name`: A path to an mp3 file containing the audio question - `id`: A integer identifier for each question ## Dataset Creation ### Curation Rationale The introduction of native audio to audio models, provides exciting opportunities for simplifying voice agent workflows. However it is important to understand whether this increase in simplicity is at the expense of model intelligence or other tradeoffs. We have created this dataset to enable benchmarking of native audio models on reasoning tasks. We leverage Big Bench Hard given its wide usage in the text domain and curate categories based on those that are the least likely to result in unfair penalisation for audio models. This includes categories that heavily rely on symbols or that require disambiguation of the spelling of words, which can not be done in an audio setting. Further we require all categories included in this dataset have an average human rater score above 80% and max achieved score of 100% in a text setting. ### Source Data The text questions from [Big Bench Hard](https://arxiv.org/pdf/2210.09261) were taken verbatim and the string ". Answer the question" was appended to each base question prior to generating audio versions of the question. This was done to keep comparison as similar as possible to Big Bench Hard whilst addressing an edge case where audio generations would sometimes not fully pronounce the final word. In the original version this would mean potentially not fully pronouncing an answer option which we considered a critical failure. Our modified version successfully avoids these critical failures. #### Generating the audio Audio was generated from 23 possible voice configurations using models provided by OpenAi, Microsoft Azure and Amazon. These models have all been validated as having high human preference via the [Artifical Analysis Speech Arena](https://artificialanalysis.ai/text-to-speech/arena?tab=Leaderboard). Models were selected randomly during the generation. The full list of voices used are as follows: OpenAI - HD: alloy, echo, fable, onyx, nova and shimmer - SD: alloy, echo, fable, onyx, nova and shimmer Azure - en-US-AndrewMultilingualNeural, en-US-BrianMultilingualNeural, en-US-AvaMultilingualNeural, en-US-EmmaMultilingualNeural, en-GB-RyanNeural, en-GB-AlfieNeural, en-GB-LibbyNeural and en-GB-SoniaNeural AWS Polly - Long Form: Gregory, Danielle and Ruth #### Verifying the audio We compute the levenshtein distance between a transcribed version of the generated audio and the source text. We then normalise this value based on the length of the text to get a value between 0 and 1. We orient the score so that a value of 1 represents an exact match. We then manually review all audio files below a threshold of 0.85. This process flags 35 audio files. After manual review of all of these audio files we do not identify any deviation from the question in the audio. We further compare the performance of GPT-4o on the original text and transcribed text and observe a < 1p.p drop in performance for the transcribed variant when evaluated with a sonnet 3.5 judge. ## Considerations for Using the Data ### Discussion of Biases All audio is generated in English and primarily focus on US and UK accents. Overfitting to this benchmark may lead to neglecting other lower resource languages and accents. The dataset also inherits any biases present for the categories we have selected from the original Big Bench Hard dataset. ## Additional Information ### Dataset Curators - Micah Hill-Smith - George Cameron - Will Bosler ### Contact You can reach us through: - Email: [info@artificialanalysis.ai](mailto:info@artificialanalysis.ai) - Contact form: [artificialanalysis.ai/contact](https://artificialanalysis.ai/contact) ### Citation Information If your research leverages this dataset consider citing Artificial Analysis, the original Big Bench paper and the Big Bench Hard paper. ``` @article{srivastava2022beyond, title={Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models}, author={Srivastava, Aarohi and Rastogi, Abhinav and Rao, Abhishek and Shoeb, Abu Awal Md and Abid, Abubakar and Fisch, Adam and Brown, Adam R and Santoro, Adam and Gupta, Aditya and Garriga-Alonso, Adri{\`a} and others}, journal={arXiv preprint arXiv:2206.04615}, year={2022} } @article{suzgun2022challenging, title={Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them}, author={Suzgun, Mirac and Scales, Nathan and Sch{\"a}rli, Nathanael and Gehrmann, Sebastian and Tay, Yi and Chung, Hyung Won and Chowdhery, Aakanksha and Le, Quoc V and Chi, Ed H and Zhou, Denny and Wei, Jason}, journal={arXiv preprint arXiv:2210.09261}, year={2022} } ```

# 人工分析大基准音频数据集（Artificial Analysis Big Bench Audio） ## 数据集说明 - **排行榜**：[https://artificialanalysis.ai/speech-to-speech](https://artificialanalysis.ai/speech-to-speech) - **联系人**：[info@artificialanalysis.ai](mailto:info@artificialanalysis.ai) ### 数据集概述 Big Bench Audio是Big Bench Hard（大基准测试难题集）子集问题的音频版本，可用于评估支持音频输入的模型的推理能力。本数据集包含来自以下Big Bench Hard类别的全部问题对应的1000条音频录音，类别描述源自[Suzgun等人(2022)](https://arxiv.org/pdf/2210.09261)： - **形式谬误三段论否定（Formal Fallacies）**：共250个问题 - 给定包含一组陈述（由某一论证模式生成）的上下文，判断以非正式形式呈现的论证是否可从给定上下文逻辑推导得出。 - **导航（Navigate）**：共250个问题 - 给定智能体的一系列导航步骤，判断智能体最终是否会回到初始起点。 - **物体计数（Object Counting）**：共250个问题 - 给定某人拥有的物品集合及其数量（例如3架钢琴、2颗草莓、1张桌子和2个西瓜），计算某一特定物品/类别（例如水果）的总数。 - **谎言网络（Web of Lies）**：共250个问题 - 评估以自然语言文字问题形式呈现的随机布尔函数的真值。 ### 支持任务与排行榜 - `音频到音频（Audio-to-Audio）`：本数据集可用于评估经过指令微调的音频到音频模型，同样适用于测试音频到文本流水线。相关排行榜可访问[https://artificialanalysis.ai/speech-to-speech](https://artificialanalysis.ai/speech-to-speech) ### 语言说明所有音频录音均为英语，音频由[人工分析语音竞技场（Artificial Analysis Speech Arena）](https://artificialanalysis.ai/text-to-speech/arena?tab=Leaderboard)中顶级服务商提供的23种语音合成生成。 ## 数据集结构 ### 数据实例本数据集的每个实例包含四个字段：类别、官方答案、文件名、ID。示例如下： json { "category": "formal_fallacies", "official_answer": "invalid", "file_name": "data/question_0.mp3", "id": 0 } ### 数据字段说明 - `category`：关联的Big Bench Hard类别 - `official_answer`：关联的Big Bench Hard官方答案 - `file_name`：包含音频问题的MP3文件路径 - `id`：每个问题的整数标识符 ## 数据集构建 ### 遴选依据原生音频输入的音频模型的普及，为简化语音智能体工作流提供了极具前景的机遇。但我们亟需明确，这种易用性的提升是否会以模型智能或其他性能权衡为代价。我们构建本数据集的目的是为原生音频模型的推理任务基准测试提供支持。我们选用Big Bench Hard，是因其在文本领域应用广泛；同时我们遴选了那些最不可能对音频模型造成不公平惩罚的类别，包括那些高度依赖符号或需要对单词拼写进行消歧的类别——这类任务无法在音频场景中完成。此外，我们要求本数据集包含的所有类别，在文本场景下的人工评分平均分高于80%，且最高分为100%。 ### 源数据我们直接复用[Big Bench Hard](https://arxiv.org/pdf/2210.09261)中的文本问题，并在每个基础问题末尾追加字符串". Answer the question"后，再生成对应的音频版本。此举旨在尽可能保持与原始Big Bench Hard的可比性，同时解决音频生成时偶尔无法完整读出末尾单词的边缘场景问题。在原始版本中，这可能导致答案选项未被完整读出，我们认为这是严重缺陷；我们的修改版本成功规避了这类严重问题。 #### 音频生成环节我们使用OpenAI、微软Azure及亚马逊提供的模型，从23种可选语音配置中生成音频。这些模型均通过[人工分析语音竞技场](https://artificialanalysis.ai/text-to-speech/arena?tab=Leaderboard)验证，获得了较高的人工偏好评分。生成过程中，模型将被随机选取。所用语音的完整列表如下： - OpenAI - 高清模式（HD）：alloy、echo、fable、onyx、nova、shimmer - 标准模式（SD）：alloy、echo、fable、onyx、nova、shimmer - 微软Azure - en-US-AndrewMultilingualNeural、en-US-BrianMultilingualNeural、en-US-AvaMultilingualNeural、en-US-EmmaMultilingualNeural、en-GB-RyanNeural、en-GB-AlfieNeural、en-GB-LibbyNeural、en-GB-SoniaNeural - AWS Polly - 长文本模式：Gregory、Danielle、Ruth #### 音频验证环节我们将生成音频的转录文本与源文本进行比对，计算二者的莱文斯坦距离，再基于文本长度对该值进行归一化，得到0至1之间的评分。我们将评分校准为：分值1代表完全匹配。随后，我们会手动审核所有评分低于0.85的音频文件。该流程共标记出35个音频文件。经逐一人工审核，我们未发现这些音频文件存在与问题描述不符的内容。我们进一步对比了GPT-4o在原始文本与转录文本上的性能表现，在使用sonnet 3.5作为评判模型进行评估时，转录文本变体的性能降幅小于1个百分点。 ## 数据使用注意事项 ### 偏差说明所有音频均为英语，且主要采用美式与英式口音。若在本基准测试集上出现过拟合，可能导致模型对其他低资源语言及口音的处理能力被忽视。本数据集同时继承了原始Big Bench Hard数据集中所选类别的所有固有偏差。 ## 附加信息 ### 数据集编者 - Micah Hill-Smith - George Cameron - Will Bosler ### 联系方式您可通过以下方式联系我们： - 邮箱：[info@artificialanalysis.ai](mailto:info@artificialanalysis.ai) - 联系表单：[artificialanalysis.ai/contact](https://artificialanalysis.ai/contact) ### 引用说明若您的研究使用了本数据集，请引用Artificial Analysis、原始Big Bench论文及Big Bench Hard论文。 bibtex @article{srivastava2022beyond, title={Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models}, author={Srivastava, Aarohi and Rastogi, Abhinav and Rao, Abhishek and Shoeb, Abu Awal Md and Abid, Abubakar and Fisch, Adam and Brown, Adam R and Santoro, Adam and Gupta, Aditya and Garriga-Alonso, Adri{`a} and others}, journal={arXiv preprint arXiv:2206.04615}, year={2022} } @article{suzgun2022challenging, title={Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them}, author={Suzgun, Mirac and Scales, Nathan and Sch{"a}rli, Nathanael and Gehrmann, Sebastian and Tay, Yi and Chung, Hyung Won and Chowdhery, Aakanksha and Le, Quoc V and Chi, Ed H and Zhou, Denny and Wei, Jason}, journal={arXiv preprint arXiv:2210.09261}, year={2022} }

提供机构：

maas

创建时间：

2025-11-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集