sbintuitions/voicebench-ja

Name: sbintuitions/voicebench-ja
Creator: sbintuitions
Published: 2026-03-30 07:50:18
License: 暂无描述

Hugging Face2026-03-30 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/sbintuitions/voicebench-ja

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ja dataset_info: - config_name: elyza features: - name: key dtype: string - name: audio dtype: audio - name: text dtype: string - name: reference dtype: string - name: eval_aspect dtype: string splits: - name: test num_bytes: 18600241 num_examples: 36 download_size: 15982235 dataset_size: 18600241 - config_name: jamc-qa features: - name: key dtype: string - name: audio dtype: audio - name: text dtype: string - name: answer dtype: string - name: answer_choice dtype: string - name: category dtype: string splits: - name: test num_bytes: 1438325742.604 num_examples: 1452 download_size: 1138249746 dataset_size: 1438325742.604 - config_name: m-ifeval features: - name: key dtype: string - name: audio dtype: audio - name: text dtype: string - name: constraints dtype: string splits: - name: test num_bytes: 94919092 num_examples: 172 download_size: 86315951 dataset_size: 94919092 - config_name: spoken-elyza features: - name: audio dtype: audio - name: key dtype: string - name: text dtype: string - name: reference dtype: string - name: eval_aspect dtype: string splits: - name: test num_bytes: 18152498 num_examples: 34 download_size: 15565870 dataset_size: 18152498 configs: - config_name: elyza data_files: - split: test path: elyza/test-* - config_name: jamc-qa data_files: - split: test path: jamc-qa/test-* - config_name: m-ifeval data_files: - split: test path: m-ifeval/test-* - config_name: spoken-elyza data_files: - split: test path: spoken-elyza/test-* license: cc-by-sa-4.0 --- ## Dataset Summary このデータセットは、音声言語モデルにおいて音声が入力された場合とテキストが入力された場合の間に生じる知能や推論能力の差を定量的に評価するために構築されました。以下の3つのテキストベンチマークのサンプルに音声合成を適用して4つのサブセットで構成されています。 - [Elyza-tasks-100](https://huggingface.co/datasets/elyza/ELYZA-tasks-100) - [M-IFEval](https://github.com/lightblue-tech/M-IFEval) - [JamC-QA](https://huggingface.co/datasets/sbintuitions/JamC-QA) 音声合成モデルにはSB Intuitions社内のTTSモデルを使用しています。合成の際には、[JVS](https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus)コーパスの音声を音声プロンプトとして使用しています。 ## Benchmarks ### Elyza このサブセットは、[elyza-tasks-100](https://huggingface.co/datasets/elyza/ELYZA-tasks-100)から36件のサンプルを抽出し、一部のテキストに修正を加えた上で音声合成を行ったデータセットです。 ### Spoken-Elyza このサブセットは、モデルの応答が音声対話に適しているかを評価するものです。Spoken-Elyzaは、[gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b)を使用して、Elyzaサブセットの`reference`をマークダウンや記号といった音声として伝えられない要素を除去し、音声用に調整しました。最後に、音声のみで応答が完全に理解できるかを確認するため、人間によるリスニング検証を実施しました。処理とフィルタリングを経て、最終的に34件のサンプルが残されています。 ### M-IFEval このサブセットは[M-IFEval](https://github.com/lightblue-tech/M-IFEval/blob/main/data/ja_input_data.jsonl)の入力プロンプトを音声で読み上げ可能な形式に修正し、音声合成を用いて作成されました。評価時の制約（constraints）はオリジナルと同じものを使用しています。 ### JamC-QA このサブセットは[JamC-QA](https://huggingface.co/datasets/sbintuitions/JamC-QA)のMultiple Choice QAサンプルに対して`A, B, C, D`のラベルをつけて音声合成した結果です。 2309件のサンプルのうち、音声言語モデルの評価として適切なもののみを選択した1452件を用いています。 ## Licensing Information ### Text Data - CC BY-SA 4.0 ### Audio Data 音声データについては、以下の制約が適用されます。 - 商用利用禁止 - 再配布禁止 ## Evaluation Example ### Response Generation ```python import base64 import json from datasets import Audio, load_dataset from huggingface_hub import hf_hub_download from qwen_omni_utils import process_mm_info from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor model_name = "Qwen/Qwen2.5-Omni-7B" device = "cuda" dataset_name = "sbintuitions/voicebench-ja" subset_name = "jamc-qa" # ["elyza", "spoken-elyza", "m-ifeval", "jamc-qa"] output_path = f"{subset_name}_results.jsonl" model = Qwen2_5OmniForConditionalGeneration.from_pretrained(model_name, device_map="auto", dtype="auto") processor = Qwen2_5OmniProcessor.from_pretrained(model_name) ds = load_dataset(dataset_name, subset_name, split="test").cast_column("audio", Audio(decode=False)) # Result formatter def format_elyza(item, response): return { "lm_output": response, "references": [item["reference"]], "task_inputs": { "messages": [{"role": "user", "content": item["text"]}], "eval_aspect": item["eval_aspect"], }, } def format_m_ifeval(item, response): return { "lm_output": response, "references": [], "task_inputs": { "constraints": json.loads(item["constraints"]), }, } def format_jamc_qa(item, response): return { "lm_output": response, "references": [item["answer_choice"]], "task_inputs": {"category": item["category"]}, } result_formatter = { "elyza": format_elyza, "spoken-elyza": format_elyza, "m-ifeval": format_m_ifeval, "jamc-qa": format_jamc_qa, }[subset_name] results = [] for item in ds: audio_base64 = base64.b64encode(item["audio"]["bytes"]).decode("utf-8") conversation = [ { "role": "system", "content": [ { "type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", } ], }, {"role": "user", "content": [{"type": "audio", "audio": "data:audio/wav;base64," + audio_base64}]}, ] if subset_name == "jamc-qa": conversation[0]["content"].append( { "type": "text", "text": 'Please show your choice in the `answer` field with only the choice letter, e.g., "answer": "C".', } ) text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios, images, videos = process_mm_info(conversation, use_audio_in_video=False) inputs = processor( text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=False, ) inputs = inputs.to(model.device).to(model.dtype) text_ids = model.generate( **inputs, use_audio_in_video=False, return_audio=False, ) response = processor.batch_decode(text_ids[:, inputs["input_ids"].shape[-1] :], skip_special_tokens=True)[0] # Format results for flexeval_file compatible input results.append(result_formatter(item, response)) with open(output_path, "w", encoding="utf-8") as f: for result in results: print(json.dumps(result, ensure_ascii=False), file=f) # Save required jsonnet file for evaluation hf_hub_download( repo_id=dataset_name, repo_type="dataset", filename=f"{subset_name}-flexeval-metrics.jsonnet", local_dir="./" ) ``` ### Scoring [flexeval](https://github.com/sbintuitions/flexeval/tree/main)を使用してスコアリングを実施します。 ```bash export OPENAI_API_KEY="sk-..." # elyza, spoken-elyzaの場合 flexeval_file --eval_file <result-path.jsonl> --metrics <subset-metrics.jsonnet> --save_dir <path-to-save> ``` ## Reference ```text @misc{takamichi2019jvscorpusfreejapanese, title={JVS corpus: free Japanese multi-speaker voice corpus}, author={Shinnosuke Takamichi and Kentaro Mitsui and Yuki Saito and Tomoki Koriyama and Naoko Tanji and Hiroshi Saruwatari}, year={2019}, eprint={1908.06248}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/1908.06248}, } @misc{elyzatasks100, title={ELYZA-tasks-100: 日本語instructionモデル評価データセット}, url={https://huggingface.co/elyza/ELYZA-tasks-100}, author={Akira Sasaki and Masato Hirakawa and Shintaro Horie and Tomoaki Nakamura}, year={2023}, } @misc{zhao2026speechworthyalignmentjapanesespeechllms, title={Speech-Worthy Alignment for Japanese SpeechLLMs via Direct Preference Optimization}, author={Mengjie Zhao and Lianbo Liu and Yusuke Fujita and Hao Shi and Yuan Gao and Roman Koshkin and Yui Sudo}, year={2026}, eprint={2603.12565}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2603.12565}, } @article{Dussolle2025MIFEval, title={M-IFEval: Multilingual Instruction-Following Evaluation}, author={Antoine Dussolle and Andrea Cardena Díaz and Shota Sato and Peter Devine}, year={2025}, journal={arXiv preprint}, volume={arXiv:2502.04688}, eprint={2502.04688}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.04688} } @inproceedings{Oka2025, author={岡照晃, 柴田知秀, 吉田奈央}, title={JamC-QA: 日本固有の知識を問う多肢選択式質問応答ベンチマークの構築}, year={2025}, month={March}, booktitle={言語処理学会第31回年次大会(NLP2025)}, pages={839--844}, } ```

提供机构：

sbintuitions

5,000+

优质数据集

54 个

任务类型

进入经典数据集