five

MCIF

收藏
魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/FBK-MT/MCIF
下载链接
链接失效反馈
官方服务:
资源简介:
<p align="center"> <img src="./mcif_logo.png" width="600"> </p> ### Dataset Description, Collection, and Source MCIF (Multimodal Crosslingual Instruction Following) is a multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities -- speech, vision, and text -- and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information. ### License - CC-BY-4.0 ### Dataset Sources - **Repository:** [MCIF](https://github.com/hlt-mt/mcif) - **Paper:** [MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks](https://arxiv.org/abs/2507.19634) ## Dataset Structure ### Data Config This dataset contains **4 splits** organized by three dimensions following the split naming convention `{track}_{prompt_type}`. Track - Input duration: * `long`: Full-length, unsegmented inputs * `short`: Pre-segmented inputs Prompt Type - Prompt variation: * `fixed`: Standardized prompts across all examples * `mixed`: Includes prompt variations Please note that all splits share the same set of original input audio and video files. The splits are meant to facilitate testing various use cases. ### Dataset Fields | **Field** | **Type** | **Description** | |-----------------|------------|-----------------------------------------------| | `id` | `string` | Unique identifier for the sample, it starts with `QA` (question answering), `SUM` (summarization), `ASR` (transcription), or `TRANS` (translation). | | `audio` | `str` | In the `long` track: path to full talk-level audio. In the `short` track: path to pre-segmented audio. | | `video` | `str` | In the `long` track: path to full talk-level video. In the `short` track: path to pre-segmented video. | | `text` | `string` | Transcript of input. Only present in the `long` track. | | `prompt_{en, de, it, zh}` | `string` | Instruction in English, German, Italian, or Chinese. | | `metadata` | `string` | Meta data for question answering samples, in the format {qa_type={`A` (audio), `V` (visual), `AV` (audio-visual), `NA` (not answerable)} qa_origin={`Transcript`, `Abstract`, `General`}} | The audio/video paths are relative within this repo. You can download the data by cloning this repo: ``` git clone https://huggingface.co/datasets/FBK-MT/MCIF ``` ### References The references are available in `MCIF.{short,long}.{en,de,it,zh}.ref.xml.gz` (navigate to "Files and versions" tab or clone this repo). ### IWSLT 2025 Version Part of MCIF was used in the [IWSLT 2025 instruction-following track](https://iwslt.org/2025/instruction-following). This test data is available under branch `IWSLT2025`. You can access it by ``` dataset = load_dataset("FBK-MT/MCIF", "{en,de,it,zh}_{long,short}", revision="IWSLT2025") ``` ## Evaluation Please use the official evaluation scripts from the [MCIF GitHub Repo](https://github.com/hlt-mt/mcif). The references are also available there. ## Changelog ### Version 1.1 - Fixed German summarization prompt - Renamed files not to include version name in the filename ## Citation ``` @misc{papi2025mcifmultimodalcrosslingualinstructionfollowing, title={MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks}, author={Sara Papi and Maike Züfle and Marco Gaido and Beatrice Savoldi and Danni Liu and Ioannis Douros and Luisa Bentivogli and Jan Niehues}, year={2025}, eprint={2507.19634}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.19634}, } ``` ## Dataset Card Contact [@spapi](https://huggingface.co/spapi) and [@danniliu](https://huggingface.co/danniliu)

<p align="center"><img src="./mcif_logo.png" width="600"></p> ### 数据集描述、收集与来源 MCIF(Multimodal Crosslingual Instruction Following,多模态跨语言指令遵循)是一个基于学术演讲的多语言人工标注基准测试集,旨在评估大语言模型(LLM)在跨语言、多模态场景下对长短格式输入的指令遵循能力。MCIF涵盖三大核心模态——语音、视觉与文本,以及四种覆盖广泛的语言(英语、德语、意大利语与中文),能够全面评估多模态大语言模型(Multimodal Large Language Model,MLLM)跨语言理解指令,并结合多模态上下文信息的能力。 ### 许可证 - CC-BY-4.0 ### 数据集来源 - **代码仓库**:[MCIF](https://github.com/hlt-mt/mcif) - **相关论文**:[MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks](https://arxiv.org/abs/2507.19634) ## 数据集结构 ### 数据配置 本数据集包含**4个拆分集**,按照三个维度进行组织,拆分命名遵循`{track}_{prompt_type}`的格式约定。 - 赛道(Track):输入时长维度 * `long`:完整未分段的输入内容 * `short`:预先分段的输入内容 - 提示类型(Prompt Type):提示变体类型 * `fixed`:所有样本统一使用标准化提示词 * `mixed`:包含多种变体的提示词 请注意,所有拆分集共享同一套原始输入音频与视频文件,设置不同拆分集旨在便于测试多样化的应用场景。 ### 数据集字段 | **字段名** | **数据类型** | **字段说明** | |-----------------|------------|-----------------------------------------------| | `id` | `string` | 样本唯一标识符,前缀以`QA`(问答)、`SUM`(摘要生成)、`ASR`(语音转录)或`TRANS`(翻译)开头。 | | `audio` | `str` | 对于`long`赛道:指向完整演讲级音频文件的路径;对于`short`赛道:指向预先分段音频文件的路径。 | | `video` | `str` | 对于`long`赛道:指向完整演讲级视频文件的路径;对于`short`赛道:指向预先分段视频文件的路径。 | | `text` | `string` | 输入内容的转录文本,仅在`long`赛道中存在。 | | `prompt_{en, de, it, zh}` | `string` | 分别为英语、德语、意大利语或中文的指令提示词。 | | `metadata` | `string` | 问答样本的元数据,格式为`{qa_type={`A`(音频), `V`(视觉), `AV`(音视频), `NA`(无法回答)}, qa_origin={`Transcript`, `Abstract`, `General`}}` | 音频与视频路径均为本仓库内的相对路径。 你可以通过克隆本仓库下载数据: git clone https://huggingface.co/datasets/FBK-MT/MCIF ### 参考数据集 参考数据集存储于`MCIF.{short,long}.{en,de,it,zh}.ref.xml.gz`(可前往仓库的"Files and versions"标签页或克隆本仓库获取)。 ### IWSLT 2025 版本 MCIF的部分内容被应用于[IWSLT 2025指令遵循赛道](https://iwslt.org/2025/instruction-following)。 该测试数据位于`IWSLT2025`分支下,你可以通过以下代码访问该数据集: dataset = load_dataset("FBK-MT/MCIF", "{en,de,it,zh}_{long,short}", revision="IWSLT2025") ## 评估 请使用[MCIF GitHub仓库](https://github.com/hlt-mt/mcif)提供的官方评估脚本,参考数据集也可在该仓库中获取。 ## 变更日志 ### 版本 1.1 - 修复了德语摘要生成提示词 - 重命名文件,使其文件名中不再包含版本号 ## 引用 @misc{papi2025mcifmultimodalcrosslingualinstructionfollowing, title={MCIF: 基于学术演讲的多模态跨语言指令遵循基准测试集}, author={Sara Papi and Maike Züfle and Marco Gaido and Beatrice Savoldi and Danni Liu and Ioannis Douros and Luisa Bentivogli and Jan Niehues}, year={2025}, eprint={2507.19634}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.19634}, } ## 数据集卡片联系人 [@spapi](https://huggingface.co/spapi) 与 [@danniliu](https://huggingface.co/danniliu)
提供机构:
maas
创建时间:
2025-10-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作