stcoats/MD_NLP

Name: stcoats/MD_NLP
Creator: stcoats
Published: 2026-03-23 08:52:54
License: 暂无描述

Hugging Face2026-03-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/stcoats/MD_NLP

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: interview_id dtype: int64 - name: segment_id dtype: string - name: role dtype: string - name: start dtype: float32 - name: end dtype: float32 - name: transcript dtype: string - name: word_tokens list: - name: word dtype: string - name: start dtype: float32 - name: end dtype: float32 - name: audio dtype: audio: decode: false - name: student_sex dtype: string - name: state dtype: string - name: town_city dtype: string - name: recording_year dtype: string - name: institution dtype: string splits: - name: train num_bytes: 18259212701 num_examples: 257357 download_size: 15639326197 dataset_size: 18259212701 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-nc-4.0 language: - en --- # MD_NLP ## Dataset Description **MD_NLP** is a discourse-annotated, word-aligned, and georeferenced corpus derived from the narrative portion of the Mitchell–Delbridge recordings, a large mid-20th-century archive of Australian English. The corpus was constructed from archival WAV recordings using an automated pipeline combining WhisperX-based ASR, neural speaker diarization, LLM-assisted discourse-role correction, and Montreal Forced Aligner boundary refinement. The released dataset consists of short, role-consistent narrative segments with transcripts, word-level timestamps, linked audio, and selected sociodemographic metadata. - **Curated by:** Steven Coats - **Institution:** University of Oulu - **Language(s):** English (Australian English) - **License:** [add license] - **Related paper:** *MD_NLP: Reconstructing an Australian English Heritage Dialect Corpus from the Mitchell–Delbridge Recordings through LLM-Assisted Speaker Attribution* ## Dataset Summary The source archive comprises recordings of 7,735 Australian secondary school pupils from 327 locations across Australia, recorded in 1959–1960. MD_NLP includes the spontaneous narrative component of these recordings rather than the read word-list and sentence materials more commonly used in previous research. The dataset is intended for research on: - Australian English variation - dialectology and sociolinguistics - discourse structure and turn-taking - corpus phonetics - ASR, diarization, and alignment on legacy speech recordings ## Dataset Structure Each row corresponds to a short, role-consistent segment. ### Fields - **interview_id**: numeric interview identifier - **segment_id**: unique segment identifier - **role**: discourse role label (`Student` or `Teacher`) - **start**: segment start time in seconds - **end**: segment end time in seconds - **transcript**: transcript text for the segment - **word_tokens**: list of word-level tokens with start and end times - **audio**: path/reference to the corresponding audio segment - **student_sex**: recorded sex metadata for the student - **state**: Australian state or territory - **town_city**: locality - **recording_year**: recording year - **institution**: school/institution name ### Split The current release contains one split: - **train**: 257,357 segments ## Corpus Size - **Recording length:** 214.14 hours - **Speech duration:** 137.95 hours - **Turns:** 71,929 - **Word count:** 1,791,856 Role-based totals: | Metric | Student | Teacher | Total | |---|---:|---:|---:| | Speech duration (h) | 92.71 | 45.24 | 137.95 | | Turns | 46,026 | 25,903 | 71,929 | | Word count | 1,155,994 | 635,862 | 1,791,856 | ## Source Data Mitchell, Alexander George and Arthur Delbridge. (1998). The speech of Australian adolescents: Research data and recordings collected by AG Mitchell and Arthur Delbridge in 1959 and 1960. The University of Sydney. https://doi.org/10.25910/jkwy-wk76 The dataset is derived from the Mitchell–Delbridge recordings, made by schoolteacher volunteers in 1959 and 1960 in 327 locations across all Australian states and territories. The original archive contains read materials and a short narrative component. MD_NLP includes only the narrative recordings. The narratives typically involve brief teacher–student interaction, though some recordings are more monologic. Recording conditions vary substantially across sites. ## Processing The corpus was created using the following pipeline: 1. **WhisperX** for automatic speech recognition and initial word alignment 2. **Pyannote** for speaker diarization 3. **LLM-assisted discourse-role correction** (Gemini 2.5-flash) to distinguish `Teacher` and `Student` 4. **Montreal Forced Aligner (MFA)** for boundary refinement 5. Reconstruction into short, role-consistent segments with word-level timing The released transcripts preserve the original WhisperX tokenization while using refined timestamps where alignment succeeded. ## Evaluation Speaker-role attribution was evaluated on 10 manually checked narratives (approximately 30 minutes of speech; 185 turns). | System | Accuracy | |---|---:| | Baseline (WhisperX + Pyannote) | 62.70% | | Full pipeline (LLM-assisted) | 95.68% | These results indicate that the LLM-assisted role-correction step substantially improves turn-level speaker attribution in interview-style archival recordings. ## Intended Use MD_NLP is intended for research use, especially for: - regional and social variation in Australian English - discourse and interactional structure - corpus phonetics and time-aligned speech analysis - geographically explicit dialect research - evaluation of ASR, diarization, and alignment methods on legacy speech ## Limitations - The corpus is derived from archival recordings with variable audio quality. - Some interviewer speech is faint, partially absent, or missing. - Transcripts are automatically generated and corrected, not manually transcribed throughout. - Some alignment boundaries may remain imperfect despite MFA refinement. - Metadata reflect archival source records and may contain inconsistencies or omissions. ## Sensitive Information The dataset contains speech-derived transcripts and linked metadata fields such as sex, institution, state, town/city, and recording year. These are historical archival data. Users should handle the dataset in accordance with the license and any archive-specific restrictions. ## Citation If you use this dataset, please cite the associated paper. **BibTeX** ```bibtex @inproceedings{coats2026mdnlp, title={MD\_NLP: Reconstructing an Australian English Heritage Dialect Corpus from the Mitchell--Delbridge Recordings through LLM-Assisted Speaker Attribution}, author={Coats, Steven}, booktitle={Proceedings of LREC 2026}, year={2026} }

数据集信息：特征字段： - 采访ID（interview_id）：数据类型为int64 - 片段ID（segment_id）：数据类型为string - 角色（role）：数据类型为string - 起始时间（start）：数据类型为float32 - 结束时间（end）：数据类型为float32 - 转录文本（transcript）：数据类型为string - 词级标记（word_tokens）：列表类型，包含： - 词（word）：数据类型为string - 起始时间（start）：数据类型为float32 - 结束时间（end）：数据类型为float32 - 音频（audio）：数据类型为音频格式，解码标记为false - 学生性别（student_sex）：数据类型为string - 州/领地（state）：数据类型为string - 城镇/城市（town_city）：数据类型为string - 录制年份（recording_year）：数据类型为string - 院校（institution）：数据类型为string 拆分集： - 名称：train，字节数：18259212701，样本数：257357 下载大小：15639326197字节数据集大小：18259212701字节配置项： - 配置名称：default，数据文件： - 拆分集：train，路径：data/train-* 许可证：CC BY-NC 4.0 语言：英语（en） # MD_NLP ## 数据集描述 **MD_NLP** 是一款经过话语标注、词级对齐且带有地理参考的语料库，源自米切尔-德尔布里奇（Mitchell–Delbridge）录音集的叙事部分——该录音集是20世纪中期的大型澳大利亚英语档案。本语料库基于存档WAV录音构建，采用结合了基于WhisperX的自动语音识别（Automatic Speech Recognition，ASR）、神经说话人分割聚类、大语言模型（Large Language Model，LLM）辅助话语角色校正，以及蒙特利尔强制对齐器（Montreal Forced Aligner，MFA）边界优化的自动化流程制作。本次发布的数据集包含简短且角色一致的叙事片段，附带转录文本、词级时间戳、关联音频文件以及精选的社会人口学元数据。 - **整理者**：史蒂文·科茨（Steven Coats） - **所属机构**：奥卢大学（University of Oulu） - **语言**：英语（澳大利亚英语） - **许可证**：[待补充] - **相关论文**：*MD_NLP：借助大语言模型辅助说话人归因，从米切尔-德尔布里奇录音集中重建澳大利亚英语传承方言语料库* ## 数据集概述本数据集的源档案包含1959至1960年间，在澳大利亚327个地点录制的7735名澳大利亚中学生的录音。MD_NLP仅收录这些录音中的自发性叙事部分，而非此前研究中更常用的朗读词表与语句材料。本数据集适用于以下方向的研究： - 澳大利亚英语变体研究 - 方言学与社会语言学研究 - 话语结构与话轮转换研究 - 语料库语音学研究 - 老旧语音录音的自动语音识别、说话人聚类与对齐研究 ## 数据集结构每一行对应一个简短且角色一致的片段。 ### 字段说明 - **interview_id（采访ID）**：数值型采访标识符 - **segment_id（片段ID）**：唯一片段标识符 - **role（角色）**：话语角色标签，可选值为`Student（学生）`或`Teacher（教师）` - **start（起始时间）**：片段起始时间，单位为秒 - **end（结束时间）**：片段结束时间，单位为秒 - **transcript（转录文本）**：该片段的转录文本 - **word_tokens（词级标记）**：包含词的起始与结束时间的词级标记列表 - **audio（音频）**：对应音频片段的路径/引用 - **student_sex（学生性别）**：学生的性别元数据 - **state（州/领地）**：澳大利亚的州或领地 - **town_city（城镇/城市）**：录制地点 - **recording_year（录制年份）**：录制年份 - **institution（院校）**：学校/机构名称 ### 拆分集本次发布仅包含一个拆分集： - **训练集（train）**：共257357个片段 ## 语料库规模 - **录制总时长**：214.14小时 - **语音时长**：137.95小时 - **话轮总数**：71929个 - **总词数**：1791856个按角色统计的指标： | 指标 | 学生 | 教师 | 总计 | |---|---:|---:|---:| | 语音时长（小时） | 92.71 | 45.24 | 137.95 | | 话轮数 | 46026 | 25903 | 71929 | | 词数 | 1155994 | 635862 | 1791856 | ## 源数据 **源文献**：Mitchell, Alexander George 与 Arthur Delbridge. (1998). 《澳大利亚青少年的言语：AG Mitchell与Arthur Delbridge于1959至1960年收集的研究数据与录音》. 悉尼大学. https://doi.org/10.25910/jkwy-wk76 本数据集源自米切尔-德尔布里奇录音集，该录音集由志愿教师于1959至1960年间，在澳大利亚所有州和领地的327个地点录制。原始档案包含朗读材料与简短叙事部分，MD_NLP仅收录其中的叙事录音。这些叙事通常包含简短的师生互动，部分录音则更偏向独白。不同录制地点的录音条件存在显著差异。 ## 处理流程本语料库通过以下流程构建： 1. **WhisperX**：用于自动语音识别与初始词级对齐 2. **Pyannote**：用于说话人分割聚类 3. **大语言模型（LLM）辅助话语角色校正**：使用Gemini 2.5-flash区分`Student（学生）`与`Teacher（教师）`角色 4. **蒙特利尔强制对齐器（Montreal Forced Aligner，MFA）**：用于边界优化 5. 重构为带有词级时间戳的简短、角色一致的片段本次发布的转录文本保留了原始WhisperX的标记方式，并在对齐成功的情况下使用优化后的时间戳。 ## 评估说话人角色归因任务在10段经人工校验的叙事录音（约30分钟语音，共185个话轮）上进行了评估。 | 系统 | 准确率 | |---|---:| | 基线模型（WhisperX + Pyannote） | 62.70% | | 完整流程（大语言模型辅助版） | 95.68% | 上述结果表明，大语言模型辅助的角色校正步骤，显著提升了访谈式档案录音的话轮级说话人归因效果。 ## 预期用途 MD_NLP 适用于科研场景，尤其适用于： - 澳大利亚英语的区域与社会变体研究 - 话语与互动结构研究 - 语料库语音学与时间对齐语音分析研究 - 带有地理信息的方言研究 - 老旧语音的自动语音识别、说话人聚类与对齐方法评估 ## 局限性 - 本语料库源自录音质量参差不齐的档案录音。 - 部分访谈者的语音较为微弱、部分缺失或完全丢失。 - 转录文本为自动生成并经校正，并非全程人工转录。 - 尽管经过MFA优化，部分对齐边界仍可能存在误差。 - 元数据源自档案原始记录，可能存在不一致或遗漏。 ## 敏感信息本数据集包含语音转录文本与关联元数据字段，如性别、院校、州/领地、城镇/城市及录制年份。这些均为历史档案数据。用户应遵循许可证及任何档案特定限制来处理本数据集。 ## 引用规范若您使用本数据集，请引用相关论文。 **BibTeX 格式** bibtex @inproceedings{coats2026mdnlp, title={MD_NLP: Reconstructing an Australian English Heritage Dialect Corpus from the Mitchell--Delbridge Recordings through LLM-Assisted Speaker Attribution}, author={Coats, Steven}, booktitle={Proceedings of LREC 2026}, year={2026} }

提供机构：

stcoats

5,000+

优质数据集

54 个

任务类型

进入经典数据集