ggfox00000/dia-ICSIMeetingCorpus-all

Name: ggfox00000/dia-ICSIMeetingCorpus-all
Creator: ggfox00000
Published: 2026-04-21 15:03:30
License: 暂无描述

Hugging Face2026-04-21 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/ggfox00000/dia-ICSIMeetingCorpus-all

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - automatic-speech-recognition - voice-activity-detection language: - en size_categories: - n<1K pretty_name: ICSI Meeting Corpus — full mirror (signals + annotations) tags: - icsi - meeting - speaker-diarization - diarization - asr - rttm - dialogue-acts - audio annotations_creators: - expert-generated source_datasets: - extended|icsi-meeting-corpus dataset_info: features: - name: audio dtype: audio - name: file_id dtype: string splits: - name: all num_examples: 75 --- # ICSI Meeting Corpus — Full Mirror (signals + annotations) Miroir **complet** du ICSI Meeting Corpus distribué par l'AMI Consortium (Edinburgh). Tous les fichiers sont repris *tels quels* depuis la distribution upstream, y compris la structure de dossiers. ## Contenu - **75 meetings** de discussions scientifiques/techniques réelles (~72 h d'audio) - **Signals** : `Signals/<meeting>/<meeting>.interaction.wav` — flux audio mixé "interaction" (le mixdown standard utilisé pour les benchmarks diarisation / ASR distant-mic) - **Core annotations (NXT)** : `ICSI/` — segments, mots, actes de dialogue, qualité de parole, métadonnées speakers au format NXT XML - **Core + contributions** : `ICSIplus/` — tout ce qui est dans `ICSI/` plus les annotations tierces (prosodie automatique, ASR automatique, topics auto/Galley/TNO, Bales codings, résumés extractifs, summarization, Wrede hotspots, segmentation par topic, subjectivité, Anno22L, etc.) - **MRT originaux** : `ICSI_original_transcripts/` — 77 transcripts au format MRT historique + documentation HTML - Langue : **anglais (US/international — meetings ICSI)** - Licence : **CC-BY-4.0** (cf `CCBY4.0.txt` à la racine) ## Structure ``` dia-ICSIMeetingCorpus-all/ ├── README.md ├── CCBY4.0.txt ├── icsiBuild-*.manifest.txt ├── Signals/ │ └── <meeting>/<meeting>.interaction.wav # 75 WAV ├── ICSI/ # NXT core v1.0 │ ├── 00README.txt │ ├── LICENCE.txt │ ├── ICSI-metadata.xml │ ├── speakers.xml │ ├── Segments/<meeting>.<chan>.segs.xml # 494 fichiers │ ├── Words/<meeting>.<chan>.words.xml # 494 fichiers │ ├── DialogueActs/<meeting>.<chan>.dialogue-acts.xml # 465 fichiers │ └── SpeechQuality/ ├── ICSIplus/ # NXT core + contrib v1.0 │ ├── (mêmes dossiers que ICSI/) │ └── Contributions/ │ ├── Anno22L/ │ ├── AutomaticProsody/ │ ├── AutomaticSpeechRecognition/ │ ├── AutomaticTopics/ │ ├── BalesCodings/ │ ├── ExtractiveSummaries/ │ ├── GalleyTopics/ │ ├── Summarization/ │ ├── TNOTopics/ │ ├── TopicSegmentation/ │ ├── WredeHotspots/ │ ├── subjectivity/ │ └── README └── ICSI_original_transcripts/ ├── doc/ ├── index.html └── transcripts/<meeting>.mrt # 77 .mrt ``` ## Meetings disponibles (75) ``` Bdb001, Bed002..Bed006, Bed008..Bed017, Bmr001..Bmr003, Bmr005..Bmr016, Bmr018..Bmr031, Bns001..Bns003, Bro003..Bro005, Bro007..Bro008, Bro010..Bro019, Bro021..Bro028, Bsr001, Btr001..Btr002, Buw001 ``` ## Format NXT (exemple Segments) ```xml <?xml version="1.0" encoding="ISO-8859-1"?> <nite:root nite:id="Bdb001.A.segs" xmlns:nite="http://nite.sourceforge.net/"> <segment nite:id="Bdb001.segment.11" starttime="9.798" endtime="11.268" participant="mn017" timing-provenance="segment"> <nite:child href="Bdb001.A.words.xml#id(Bdb001.vocalsound.3)"/> </segment> ... </nite:root> ``` Pour construire un RTTM standard à partir des segments NXT : ```python from glob import glob from xml.etree import ElementTree as ET NS = {"nite": "http://nite.sourceforge.net/"} for seg_xml in glob("ICSI/Segments/*.segs.xml"): root = ET.parse(seg_xml).getroot() meeting = root.attrib["{http://nite.sourceforge.net/}id"].split(".")[0] for seg in root.findall("segment"): spk = seg.attrib["participant"] t0 = float(seg.attrib["starttime"]) t1 = float(seg.attrib["endtime"]) print(f"SPEAKER {meeting} 1 {t0:.3f} {t1-t0:.3f} <NA> <NA> {spk} <NA> <NA>") ``` ## Utilisation ```python from huggingface_hub import snapshot_download root = snapshot_download("ggfox00000/dia-ICSIMeetingCorpus-all", repo_type="dataset") # root/Signals/<meeting>/<meeting>.interaction.wav # root/ICSI/Segments/<meeting>.<chan>.segs.xml # root/ICSIplus/Contributions/... ``` ## Source - ICSI Meeting Corpus — https://groups.inf.ed.ac.uk/ami/icsi/ - Annotations NXT — AMI/ICSI Consortium - Janin et al. 2003, *The ICSI Meeting Corpus*, ICASSP ## Licence **CC-BY-4.0** — voir `CCBY4.0.txt`. Distribution miroir de la distribution officielle du AMI Consortium pour faciliter l'accès en recherche. ## Citation ```bibtex @inproceedings{janin2003icsi, author = {Janin, Adam and Baron, Don and Edwards, Jane and Ellis, Dan and Gelbart, David and Morgan, Nelson and Peskin, Barbara and Pfau, Thilo and Shriberg, Elizabeth and Stolcke, Andreas and Wooters, Chuck}, title = {The {ICSI} Meeting Corpus}, booktitle = {ICASSP}, year = {2003}, } ```

许可证：CC-BY-4.0 任务类别： - 自动语音识别（automatic-speech-recognition） - 语音活动检测（voice-activity-detection）语言： - 英语样本规模类别： - 样本量少于1000 友好名称：ICSI会议语料库完整镜像（信号+标注集）标签： - icsi - 会议 - 说话人分离（speaker-diarization） - 语音分离（diarization） - ASR - RTTM - 对话行为（dialogue-acts） - 音频标注创作者： - 专家生成源数据集： - 扩展|icsi-meeting-corpus 数据集信息：特征： - 名称：audio，数据类型：音频 - 名称：file_id，数据类型：字符串数据划分： - 划分名称：all，样本数量：75 # ICSI会议语料库完整镜像（信号+标注集）本数据集为爱丁堡AMI联盟发布的**ICSI会议语料库（ICSI Meeting Corpus）**完整镜像，所有文件均严格按照上游原始分发版本原样复刻，包含完整的目录结构。 ## 内容 - **75场真实学术/技术研讨会议**（总时长约72小时音频） - **音频信号**：`Signals/<会议ID>/<会议ID>.interaction.wav` —— 混合式交互音频流，为远场语音分离与自动语音识别基准测试所用的标准混音方案 - **核心标注集（NXT格式）**：`ICSI/` —— 采用NXT XML格式存储的语音分段、词元、对话行为、语音质量以及说话人元数据 - **核心标注+附加贡献标注**：`ICSIplus/` —— 包含`ICSI/`的全部内容，外加第三方标注资源，涵盖自动韵律分析、自动语音识别、自动主题识别（Galley/TNO）、Bales编码、抽取式摘要、自动摘要、Wrede热点检测、主题分段、主观性分析、Anno22L标注等 - **原始MRT转录文本**：`ICSI_original_transcripts/` —— 77份历史格式MRT转录文本与HTML文档 - 语言：**英语（美式/国际通用 — ICSI会议语料）** - 许可证：**CC-BY-4.0**（详见根目录下的`CCBY4.0.txt`） ## 目录结构 dia-ICSIMeetingCorpus-all/ ├── README.md ├── CCBY4.0.txt ├── icsiBuild-*.manifest.txt ├── Signals/ │ └── <会议ID>/<会议ID>.interaction.wav # 共75个WAV文件 ├── ICSI/ # NXT核心标注集v1.0 │ ├── 00README.txt │ ├── LICENCE.txt │ ├── ICSI-metadata.xml │ ├── speakers.xml │ ├── Segments/<会议ID>.<通道号>.segs.xml # 共494个文件 │ ├── Words/<会议ID>.<通道号>.words.xml # 共494个文件 │ ├── DialogueActs/<会议ID>.<通道号>.dialogue-acts.xml # 共465个文件 │ └── SpeechQuality/ ├── ICSIplus/ # NXT核心标注集+附加贡献v1.0 │ ├── （与ICSI/目录结构一致） │ └── Contributions/ │ ├── Anno22L/ │ ├── AutomaticProsody/ │ ├── AutomaticSpeechRecognition/ │ ├── AutomaticTopics/ │ ├── BalesCodings/ │ ├── ExtractiveSummaries/ │ ├── GalleyTopics/ │ ├── Summarization/ │ ├── TNOTopics/ │ ├── TopicSegmentation/ │ ├── WredeHotspots/ │ ├── subjectivity/ │ └── README └── ICSI_original_transcripts/ ├── doc/ ├── index.html └── transcripts/<会议ID>.mrt # 共77个.mrt文件 ## 可用会议（共75场） Bdb001, Bed002..Bed006, Bed008..Bed017, Bmr001..Bmr003, Bmr005..Bmr016, Bmr018..Bmr031, Bns001..Bns003, Bro003..Bro005, Bro007..Bro008, Bro010..Bro019, Bro021..Bro028, Bsr001, Btr001..Btr002, Buw001 ## NXT标注格式（分段示例） xml <?xml version="1.0" encoding="ISO-8859-1"?> <nite:root nite:id="Bdb001.A.segs" xmlns:nite="http://nite.sourceforge.net/"> <segment nite:id="Bdb001.segment.11" starttime="9.798" endtime="11.268" participant="mn017" timing-provenance="segment"> <nite:child href="Bdb001.A.words.xml#id(Bdb001.vocalsound.3)"/> </segment> ... </nite:root> 以下为从NXT分段文件生成标准RTTM文件的示例代码： python from glob import glob from xml.etree import ElementTree as ET NS = {"nite": "http://nite.sourceforge.net/"} for seg_xml in glob("ICSI/Segments/*.segs.xml"): root = ET.parse(seg_xml).getroot() meeting = root.attrib["{http://nite.sourceforge.net/}id"].split(".")[0] for seg in root.findall("segment"): spk = seg.attrib["participant"] t0 = float(seg.attrib["starttime"]) t1 = float(seg.attrib["endtime"]) print(f"SPEAKER {meeting} 1 {t0:.3f} {t1-t0:.3f} <NA> <NA> {spk} <NA> <NA>") ## 使用方法 python from huggingface_hub import snapshot_download root = snapshot_download("ggfox00000/dia-ICSIMeetingCorpus-all", repo_type="dataset") # root/Signals/<会议ID>/<会议ID>.interaction.wav # root/ICSI/Segments/<会议ID>.<通道号>.segs.xml # root/ICSIplus/Contributions/... ## 数据来源 - ICSI会议语料库 — https://groups.inf.ed.ac.uk/ami/icsi/ - NXT格式标注 — AMI/ICSI联盟 - Janin et al. 2003, *The ICSI Meeting Corpus*, ICASSP ## 许可证 **CC-BY-4.0** — 详见`CCBY4.0.txt`。本镜像为AMI联盟官方分发版本的复刻，旨在方便研究人员获取该数据集。 ## 引用格式 bibtex @inproceedings{janin2003icsi, author = {Janin, Adam and Baron, Don and Edwards, Jane and Ellis, Dan and Gelbart, David and Morgan, Nelson and Peskin, Barbara and Pfau, Thilo and Shriberg, Elizabeth and Stolcke, Andreas and Wooters, Chuck}, title = {The {ICSI} Meeting Corpus}, booktitle = {ICASSP}, year = {2003}, }

提供机构：

ggfox00000

5,000+

优质数据集

54 个

任务类型

进入经典数据集