Jzuluaga/atcosim_corpus
收藏Hugging Face2022-12-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Jzuluaga/atcosim_corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: text
dtype: string
- name: segment_start_time
dtype: float32
- name: segment_end_time
dtype: float32
- name: duration
dtype: float32
splits:
- name: test
num_bytes: 471628915.76
num_examples: 1901
- name: train
num_bytes: 1934757106.88
num_examples: 7638
download_size: 0
dataset_size: 2406386022.6400003
tags:
- audio
- automatic-speech-recognition
- en-atc
- en
- robust-speech-recognition
- noisy-speech-recognition
- speech-recognition
task_categories:
- automatic-speech-recognition
language:
- en
multilinguality:
- monolingual
---
# Dataset Card for ATCOSIM corpus
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages and Other Details](#languages-and-other-details)
- [Dataset Structure](#dataset-structure)
- [Data Fields](#data-fields)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** [ATCOSIM homepage](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html)
- **Repository:** [GitHub repository (used in research)](https://github.com/idiap/w2v2-air-traffic)
- **Paper:** [The ATCOSIM Corpus of Non-Prompted Clean Air Traffic Control Speech](https://aclanthology.org/L08-1507/)
- **Paper of this research:** [How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications](https://arxiv.org/abs/2203.16822)
### Dataset Summary
The ATCOSIM Air Traffic Control Simulation Speech corpus is a speech database of air traffic control (ATC) operator speech, provided by Graz University of Technology (TUG) and Eurocontrol Experimental Centre (EEC). It consists of ten hours of speech data, which were recorded during ATC real-time simulations using a close-talk headset microphone. The utterances are in English language and pronounced by ten non-native speakers. The database includes orthographic transcriptions and additional information on speakers and recording sessions. It was recorded and annotated by Konrad Hofbauer ([description here](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html)).
### Supported Tasks and Leaderboards
- `automatic-speech-recognition`. Already adapted/fine-tuned models are available here --> [XLS-R-300m](https://huggingface.co/Jzuluaga/wav2vec2-large-960h-lv60-self-en-atc-atcosim).
### Languages and other details
The text and the recordings are in English. The participating controllers were all actively employed air traffic controllers and possessed professional experience in the simulated sectors. The six male and four female controllers were of either German or Swiss nationality and had German, Swiss German or Swiss French native tongue. The controllers had agreed to the recording of their voice for the purpose of language analysis as well as for research and development in speech technologies, and were asked to show their normal working behaviour.
## Dataset Structure
### Data Fields
- `id (string)`: a string of recording identifier for each example, corresponding to its.
- `audio (audio)`: audio data for the given ID
- `text (string)`: transcript of the file already normalized. Follow these repositories for more details [w2v2-air-traffic](https://github.com/idiap/w2v2-air-traffic) and [bert-text-diarization-atc](https://github.com/idiap/bert-text-diarization-atc)
- `segment_start_time (float32)`: segment start time (normally 0)
- `segment_end_time (float32): segment end time
- `duration (float32)`: duration of the recording, compute as segment_end_time - segment_start_time
## Additional Information
### Licensing Information
The licensing status of the dataset hinges on the legal status of the [ATCOSIM corpus](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html) creators.
### Citation Information
Contributors who prepared, processed, normalized and uploaded the dataset in HuggingFace:
```
@article{zuluaga2022how,
title={How Does Pre-trained Wav2Vec2. 0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications},
author={Zuluaga-Gomez, Juan and Prasad, Amrutha and Nigmatulina, Iuliia and Sarfjoo, Saeed and others},
journal={IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar},
year={2022}
}
@article{zuluaga2022bertraffic,
title={BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications},
author={Zuluaga-Gomez, Juan and Sarfjoo, Seyyed Saeed and Prasad, Amrutha and others},
journal={IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar},
year={2022}
}
@article{zuluaga2022atco2,
title={ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications},
author={Zuluaga-Gomez, Juan and Vesel{\`y}, Karel and Sz{\"o}ke, Igor and Motlicek, Petr and others},
journal={arXiv preprint arXiv:2211.04054},
year={2022}
}
```
Authors of the dataset:
```
@inproceedings{hofbauer-etal-2008-atcosim,
title = "The {ATCOSIM} Corpus of Non-Prompted Clean Air Traffic Control Speech",
author = "Hofbauer, Konrad and
Petrik, Stefan and
Hering, Horst",
booktitle = "Proceedings of the Sixth International Conference on Language Resources and Evaluation ({LREC}'08)",
month = may,
year = "2008",
address = "Marrakech, Morocco",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2008/pdf/545_paper.pdf",
}
```
---
数据集信息:
特征:
- 名称:id,数据类型:字符串
- 名称:audio(音频),数据类型:音频,采样率:16000
- 名称:text,数据类型:字符串
- 名称:segment_start_time(片段起始时间),数据类型:32位浮点型
- 名称:segment_end_time(片段结束时间),数据类型:32位浮点型
- 名称:duration(时长),数据类型:32位浮点型
划分集:
- 名称:test(测试集),字节数:471628915.76,样本数:1901
- 名称:train(训练集),字节数:1934757106.88,样本数:7638
下载大小:0
数据集总大小:2406386022.6400003
标签:
- 音频(audio)
- 自动语音识别(automatic-speech-recognition)
- en-atc
- 英语(en)
- 鲁棒语音识别(robust-speech-recognition)
- 噪声语音识别(noisy-speech-recognition)
- 语音识别(speech-recognition)
任务类别:
- 自动语音识别(automatic-speech-recognition)
语言:
- 英语(en)
多语言属性:
- 单语言(monolingual)
---
# ATCOSIM语料库数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言与其他细节](#languages-and-other-details)
- [数据集结构](#dataset-structure)
- [数据字段](#data-fields)
- [附加信息](#additional-information)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
## 数据集描述
- **主页**:[ATCOSIM主页](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html)
- **代码仓库**:[研究使用的GitHub仓库](https://github.com/idiap/w2v2-air-traffic)
- **原论文**:[《非提示式纯净航空交通管制语音ATCOSIM语料库》](https://aclanthology.org/L08-1507/)
- **本研究论文**:[《预训练Wav2Vec 2.0在域偏移自动语音识别中的表现如何?航空交通管制通信的全面基准测试》](https://arxiv.org/abs/2203.16822)
### 数据集概述
ATCOSIM航空交通管制模拟语音语料库是由格拉茨工业大学(TUG)与欧洲空管实验中心(EEC)提供的航空交通管制(Air Traffic Control,ATC)操作员语音数据库。该语料库包含10小时语音数据,采集自使用近距离头戴式麦克风的实时航空交通管制模拟场景。所有语音话语均为英语,由10名非母语使用者录制。该数据库包含正字法转录文本以及关于说话人与录制会话的附加信息,由Konrad Hofbauer录制并标注([详情见此](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html))。
### 支持任务与排行榜
- `自动语音识别(automatic-speech-recognition)`。已适配/微调的模型可在此获取 --> [XLS-R-300m](https://huggingface.co/Jzuluaga/wav2vec2-large-960h-lv60-self-en-atc-atcosim)。
### 语言与其他细节
所有文本与录音均为英语。参与录制的管制员均为在职航空交通管制员,具备模拟管制扇区的专业工作经验。6名男性与4名女性管制员为德国或瑞士国籍,母语为德语、瑞士德语或瑞士法语。管制员同意录制其语音用于语言分析以及语音技术的研发,并被要求展现正常的工作状态。
## 数据集结构
### 数据字段
- `id (string)`:每条样本的录音标识符字符串,对应其唯一标识。
- `audio (audio)`:对应ID的音频数据
- `text (string)`:已归一化的文件转录文本。更多细节可参考以下仓库:[w2v2-air-traffic](https://github.com/idiap/w2v2-air-traffic) 与 [bert-text-diarization-atc](https://github.com/idiap/bert-text-diarization-atc)
- `segment_start_time (float32)`:片段起始时间(通常为0)
- `segment_end_time (float32)`:片段结束时间
- `duration (float32)`:录音时长,计算方式为`segment_end_time - segment_start_time`
## 附加信息
### 许可信息
本数据集的许可状态取决于[ATCOSIM语料库](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html)创作者的法律规定。
### 引用信息
在HuggingFace上完成该数据集预处理、标准化、规范化与上传的贡献者的引用:
@article{zuluaga2022how,
title={How Does Pre-trained Wav2Vec2. 0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications},
author={Zuluaga-Gomez, Juan and Prasad, Amrutha and Nigmatulina, Iuliia and Sarfjoo, Saeed and others},
journal={IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar},
year={2022}
}
@article{zuluaga2022bertraffic,
title={BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications},
author={Zuluaga-Gomez, Juan and Sarfjoo, Seyyed Saeed and Prasad, Amrutha and others},
journal={IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar},
year={2022}
}
@article{zuluaga2022atco2,
title={ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications},
author={Zuluaga-Gomez, Juan and Vesel`y, Karel and Sz"o"ke, Igor and Motlicek, Petr and others},
journal={arXiv preprint arXiv:2211.04054},
year={2022}
}
该数据集的原作者引用:
@inproceedings{hofbauer-etal-2008-atcosim,
title = "The {ATCOSIM} Corpus of Non-Prompted Clean Air Traffic Control Speech",
author = "Hofbauer, Konrad and
Petrik, Stefan and
Hering, Horst",
booktitle = "Proceedings of the Sixth International Conference on Language Resources and Evaluation ({LREC}'08)",
month = may,
year = "2008",
address = "Marrakech, Morocco",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2008/pdf/545_paper.pdf",
}
提供机构:
Jzuluaga
原始信息汇总
数据集概述
数据集名称
- 名称: ATCOSIM corpus
数据集描述
- 描述: ATCOSIM Air Traffic Control Simulation Speech corpus是一个由Graz University of Technology (TUG)和Eurocontrol Experimental Centre (EEC)提供的空中交通管制(ATC)操作员语音数据库。该数据库包含十小时的语音数据,这些数据是在使用近讲头戴式麦克风的ATC实时模拟中录制的。语音数据为英语,由十位非母语人士发音。数据库包括正字法转录以及有关说话者和录音会话的额外信息。
支持的任务
- 任务: 自动语音识别
语言和其他细节
- 语言: 英语
- 多语言性: 单语种
数据集结构
数据字段
- id (字符串): 每个示例的录音标识符。
- audio (音频): 与给定ID对应的音频数据。
- text (字符串): 文件的转录文本,已规范化。
- segment_start_time (float32): 段开始时间(通常为0)。
- segment_end_time (float32): 段结束时间。
- duration (float32): 录音持续时间,计算为segment_end_time - segment_start_time。
数据集划分
- 测试集: 1901个示例,471628915.76字节。
- 训练集: 7638个示例,1934757106.88字节。
数据集大小
- 总大小: 2406386022.6400003字节。
搜集汇总
数据集介绍

构建方式
在航空交通管制领域,语音数据的采集与标注对于自动语音识别系统的研发至关重要。ATCOSIM语料库的构建依托于格拉茨技术大学与欧洲航空安全组织实验中心的合作,通过实时模拟航空交通管制场景,采用头戴式麦克风录制了十小时的语音数据。该数据集收录了十位非英语母语的管制员在自然工作状态下的语音,确保了数据的真实性与代表性。所有语音均经过细致的转写与时间标注,形成了包含音频片段起止时间及持续时间的结构化数据,为后续研究提供了坚实基础。
使用方法
该数据集主要应用于自动语音识别任务,尤其适合评估模型在领域迁移场景下的性能。研究人员可通过HuggingFace平台直接加载数据集,利用其提供的音频特征与文本标签进行端到端的语音识别模型训练。数据集已分割为训练集与测试集,便于进行模型验证与基准测试。此外,结合已有的预训练模型如Wav2Vec2,用户可进一步开展领域自适应研究,探索航空管制语音的识别优化策略,推动鲁棒性语音识别技术的发展。
背景与挑战
背景概述
ATCOSIM语料库作为航空交通管制语音研究的重要资源,由格拉茨科技大学与欧洲航空安全组织实验中心于2008年联合创建,核心研究人员包括Konrad Hofbauer等人。该数据集旨在解决航空管制通信场景下的自动语音识别问题,收录了十小时的非母语英语管制员语音,模拟真实管制环境下的语音交互。其构建为领域特定的语音技术研究提供了纯净的语音样本,推动了航空通信中语音识别系统的开发与优化,对提升管制效率与安全性具有深远影响。
当前挑战
在航空交通管制领域,语音识别面临独特挑战,包括非母语口音变异、专业术语密集以及高噪声环境下的语音清晰度问题。ATCOSIM语料库的构建过程中,需克服语音采集的仿真真实性难题,确保录音在模拟环境中贴近实际操作;同时,标注工作涉及复杂的时间分段与文本归一化处理,以精确对齐语音与文本信息。这些挑战共同凸显了领域自适应语音识别技术在复杂应用场景中的必要性。
常用场景
经典使用场景
在航空交通管制语音识别领域,ATCOSIM语料库作为经典的非提示性纯净语音数据集,常被用于训练和评估自动语音识别模型。该数据集收录了模拟环境中管制员的真实对话,其纯净的录音质量和精确的文本转录,为研究者提供了理想的基准测试平台,尤其在探索领域自适应和噪声鲁棒性方面具有重要价值。
解决学术问题
ATCOSIM语料库有效解决了航空管制语音识别中的领域偏移问题,为学术研究提供了关键数据支撑。通过该数据集,研究者能够深入分析非母语口音、专业术语和特定对话结构对识别性能的影响,进而推动鲁棒语音识别技术的发展,填补了航空管制这一专业领域在语音数据资源上的空白。
实际应用
在实际应用中,ATCOSIM语料库直接服务于航空管制系统的智能化升级。基于该数据集训练的模型可集成于实时语音识别系统,辅助管制员进行指令记录与核对,提升空中交通管理的安全性与效率。同时,它也为开发飞行员与管制员之间的自动通信辅助工具提供了数据基础。
数据集最近研究
最新研究方向
在航空交通管制语音识别领域,ATCOSIM语料库作为非提示性纯净语音数据集,正推动着领域自适应与鲁棒性研究的前沿探索。当前研究聚焦于预训练模型如Wav2Vec 2.0在领域偏移场景下的性能评估,通过微调策略提升模型对非母语口音与专业术语的识别精度。同时,结合BERT等自然语言理解模型,研究者致力于开发联合说话人角色与话轮转换检测技术,以增强自动化管制系统的语义解析能力。这些进展不仅优化了空中交通通信的实时处理效率,也为跨领域语音技术迁移提供了关键基准,助力智慧航空系统的安全性与智能化发展。
以上内容由遇见数据集搜集并总结生成



