five

EuroSpeech

收藏
魔搭社区2026-03-27 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/EuroSpeech
下载链接
链接失效反馈
官方服务:
资源简介:
# EuroSpeech Dataset ## Dataset Description EuroSpeech is a large-scale multilingual speech corpus containing high-quality aligned parliamentary speech across 22 European languages. The dataset was constructed by processing parliamentary proceedings using a robust alignment pipeline that handles diverse audio formats and non-verbatim transcripts. More information can be found in the [paper](https://arxiv.org/abs/2510.00514). ### Dataset Summary - **Languages**: 22 European languages (see detailed breakdown below) - **Total aligned hours**: ~78,100 hours of initially aligned speech-text data - **Quality-filtered subsets**: - CER < 30%: approximately 61,000 hours - CER < 20%: approximately 50,500 hours (this is the primary subset provided directly through the Hugging Face Datasets interface for all languages) - CER < 10%: approximately 32,200 hours - **Domain**: Parliamentary proceedings (formal speaking style) - **Audio segment length**: Typically 3-20 seconds - **Format**: Audio segments with paired transcriptions ### Languages EuroSpeech provides substantial data for previously under-resourced languages: - 19 languages exceed 1,000 hours of data (CER < 20%) - 22 languages exceed 500 hours of data (CER < 20%) | Language | Code | Total Aligned (h) | CER < 30\% (h) | CER < 20\% (h) | CER < 10\% (h) | |------------------|------|-------------------|----------------|----------------|----------------| | Croatia | hr | 7484.9 | 5899.7 | 5615.8 | 4592.0 | | Denmark | da | 7014.2 | 6435.0 | 5559.8 | 3443.7 | | Norway | no | 5326.2 | 4578.8 | 3866.7 | 2252.2 | | Portugal | pt | 5096.3 | 4036.7 | 3293.5 | 2105.9 | | Italy | it | 4812.8 | 3539.6 | 2813.7 | 1767.3 | | Lithuania | lt | 5537.9 | 3971.0 | 2681.2 | 956.6 | | United Kingdom | en | 5212.2 | 3790.7 | 2609.3 | 1175.0 | | Slovakia | sk | 2863.4 | 2722.4 | 2553.6 | 2070.8 | | Greece | el | 3096.7 | 2717.6 | 2395.4 | 1620.9 | | Sweden | sv | 3819.4 | 2862.6 | 2312.8 | 1360.1 | | France | fr | 5476.8 | 2972.1 | 2249.8 | 1347.6 | | Bulgaria | bg | 3419.6 | 2570.4 | 2200.1 | 1472.8 | | Germany | de | 2472.2 | 2354.2 | 2184.4 | 1698.4 | | Serbia | sr | 2263.1 | 1985.1 | 1855.7 | 1374.1 | | Finland | fi | 2130.6 | 1991.4 | 1848.2 | 1442.2 | | Latvia | lv | 2047.4 | 1627.9 | 1218.8 | 499.9 | | Ukraine | uk | 1287.8 | 1238.3 | 1191.1 | 1029.8 | | Slovenia | sl | 1338.2 | 1241.7 | 1156.4 | 900.5 | | Estonia | et | 1701.1 | 1430.9 | 1014.9 | 382.5 | | Bosnia \& Herz. | bs | 860.2 | 781.9 | 691.3 | 447.8 | | Iceland | is | 1586.1 | 974.1 | 647.4 | 171.4 | | Malta | mt | 3281.6 | 1284.3 | 613.0 | 143.9 | | **Total** | | **78128.6** | **61006.4** | **50572.9** | **32255.5** | ## Dataset Structure ### Data Instances Each instance in the dataset consists of: - Audio segment (3-20 seconds) - Corresponding transcript text - Metadata including language, source session, alignment quality metrics ### Data Splits The dataset provides predefined train, development, and test splits for each language. To ensure data integrity and prevent leakage between sets, these splits are constructed by assigning entire parliamentary sessions (i.e., all segments derived from a single original long audio recording) exclusively to one of the train, development, or test sets. The exact proportions follow common practices (e.g., 80/10/10). ## Dataset Creation ### Source Data The data was collected from parliamentary proceedings across 22 European nations. Parliamentary sessions offer high-quality speech in a formal register, typically featuring clear speech with good audio quality and professional transcripts. ### Data Collection and Processing The dataset was constructed using a multi-stage pipeline: 1. **Data Sourcing and Metadata Collection**: Manual and scripted gathering of media/transcript links from parliamentary websites. 2. **Download Pipeline**: Automated retrieval of audio, video, and transcript files using specialized handlers for diverse source formats. 3. **Alignment Pipeline**: - Segmentation of long recordings into 3-20 second utterances using voice activity detection (VAD) - Transcription of segments using an ASR model to produce pseudo-labels - Alignment of segments to transcripts using a novel two-stage dynamic algorithm - Selection of best-aligned transcript formats and quality filtering 4. **Filtering**: CER-based filtering to create quality tiers (CER < 30%, < 20%, < 10%) ### Alignment Algorithm The core of the alignment process is a novel two-stage dynamic algorithm specifically engineered for extreme robustness when matching ASR pseudo-labels to noisy, non-verbatim parliamentary transcripts: 1. **Coarse stage**: Uses a sliding window to rapidly scan the transcript, efficiently bypassing large irrelevant sections to identify a set of top-k candidate text spans via Character Error Rate (CER). 2. **Fine-tuning stage**: Performs a local search around promising candidates, optimizing start position and window size for the best CER. A fallback mechanism restarts the search if no initial match meets a predefined quality threshold. ## Dataset Use ### Intended Uses The EuroSpeech dataset is intended for: - Training and evaluating automatic speech recognition (ASR) systems - Training and evaluating text-to-speech (TTS) systems - Multilingual speech research - Low-resource language speech technology development - Cross-lingual transfer learning in speech models ### Citation Information If you use this dataset, please cite: ``` @inproceedings{pfisterereurospeech, title={EuroSpeech: A Multilingual Speech Corpus}, author={Pfisterer, Samuel and Gr{\"o}tschla, Florian and Lanzend{\"o}rfer, Luca A and Yan, Florian and Wattenhofer, Roger}, booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track} } ``` ## Considerations ### Data Quality The dataset provides multiple quality tiers based on Character Error Rate (CER): - CER < 30%: More data, but potentially lower quality alignments - CER < 20%: Balanced quality-quantity trade-off (recommended for most applications) - CER < 10%: Highest quality alignments, but reduced quantity ### Licensing Information The licensing terms vary by country as each parliament has its own policies. The table below provides relevant sources for each parliament in our dataset. Please note that we do not guarantee the accuracy of this information and take no responsibility for any use that conflicts with applicable licenses or laws. Users are responsible for ensuring compliance with relevant terms. #### Copyright and Licensing Information for each Parliament | Country | Source | |---------|--------| | Croatia | [Legal Notice](https://www.sabor.hr/index.php/en/legal-notice) | | Denmark | [Legal Notice](https://www.ft.dk/da/aktuelt/tv-fra-folketinget/deling-og-rettigheder#A5FB53FDE08B4CFBA457A63E7B364584) | | Norway | [NLOD License](https://data.norge.no/nlod/en/2.0) | | Portugal | [Portuguese Copyright Code](https://www.pgdlisboa.pt/leis/lei_mostra_articulado.php?artigo_id=484A0075&nid=484&tabela=leis&pagina=1&ficha=1&so_miolo=&nversao=#artigo) Article 75 | | Italy | [Italian Parliament Website](https://www.senato.it/) references [CC By 4.0 License](https://creativecommons.org/licenses/by/4.0/legalcode.it) | | Lithuania | [Republic of Lithuania Law on Copyright and Related Rights](https://www.wipo.int/edocs/lexdocs/laws/en/lt/lt081en.pdf) Article 22 | | United Kingdom | [Terms and Conditions](https://www.parliament.uk/site-information/copyright-parliament/pru-licence-agreements/downloading--sharing-terms--conditions/) for audio, [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/) for transcripts | | Slovakia | [Slovak Copyright Act](https://wipolex-res.wipo.int/edocs/lexdocs/laws/en/sk/sk096en.pdf) Chapter One Section 5e) | | Greece | [Greek Copyright Law](https://eratospe.org/2121_1993_en..pdf) Article 2(5) and Article 25(1)(b) | | Sweden | [Law (2022:818)](https://www.riksdagen.se/sv/dokument-och-lagar/dokument/svensk-forfattningssamling/lag-2022818-om-den-offentliga-sektorns_sfs-2022-818/#K2) | | France | [License Ouverte](https://www.etalab.gouv.fr/licence-ouverte-open-licence/) | | Bulgaria | [Copyright Policy](https://www.president.bg/static104/Copyright-and-Legal-Policy/?lang=en&skipMobile=1) references [CC BY 2.5 BG](https://creativecommons.org/licenses/by/2.5/bg/) | | Germany | [Terms of Use](https://www.bundestag.de/resource/blob/296018/45ce89d3a71fea6b068511a93da129bb/nutzungsbedingungen_en.pdf) | | Serbia | [Serbian Law on Copyright and Related Rights](https://wipolex-res.wipo.int/edocs/lexdocs/laws/en/rs/rs061en.html) Article 6(2) | | Finland | [Copyright Act](https://wipolex-res.wipo.int/edocs/lexdocs/laws/en/fi/fi001en.pdf) Article 9, 22, and 25 | | Latvia | [Latvian Copyright Law](https://likumi.lv/ta/en/en/id/5138) Section 21 | | Ukraine | [Law of Ukraine on Copyright and Related Rights](https://wipolex-resources-eu-central-1-358922420655.s3.amazonaws.com/edocs/lexdocs/laws/en/ua/ua210en_1.pdf) Article 8(1)(3) | | Slovenia | [Copyright and Related Rights Act](https://wipolex-res.wipo.int/edocs/lexdocs/laws/en/si/si082en.html) Article 46-51 | | Estonia | [Copyright Act](https://www.riigiteataja.ee/en/eli/525112013002/consolide), [Estonian Youtube](https://www.youtube.com/riigikogu) references [CC BY SA](https://creativecommons.org/licenses/by-sa/4.0/) | | Bosnia & Herzegovina | [Copyright Law](https://original.co.ba/file/bih-copyright-law/36) Article 44 and 47 | | Iceland | [Copyright Act](https://www.wipo.int/wipolex/en/text/128153) Article 22 | | Malta | [Re-Use of Public Sector Information Act](https://legislation.mt/eli/cap/546/20250311/eng) Chapter 546 | ### Limitations - The dataset primarily represents formal parliamentary speech and may not generalize well to casual, spontaneous, or noisy speech environments. - The dataset reflects the demographics and speaking styles of European parliamentarians, which may not be representative of the general population. - Some languages have significantly more data than others, which could lead to performance disparities in multilingual models. ## Additional Information ### Dataset Curators - Samuel Pfisterer ([@SamuelPfisterer1](https://huggingface.co/SamuelPfisterer1)) - Florian Grötschla ([@FloGr](https://huggingface.co/FloGr)) - Luca Lanzendörfer ([@lucala](https://huggingface.co/lucala)) - Florian Yan ([@floyan](https://huggingface.co/floyan)) - Roger Wattenhofer ### Maintenance Status [Information about maintenance and update plans] ### Links - [EuroSpeech on Hugging Face Datasets](https://huggingface.co/datasets/disco-eth/EuroSpeech) - [EuroSpeech GitHub Repository](https://github.com/SamuelPfisterer/EuroSpeech)

# EuroSpeech 数据集 ## 数据集说明 EuroSpeech 是一个大规模多语言语音语料库,包含覆盖22种欧洲语言的高质量对齐议会演讲语料。该数据集通过稳健的对齐流水线处理议会会议记录构建而成,可处理多种音频格式与非逐字转录文本。 ### 数据集摘要 - **语言**:22种欧洲语言(详见下文详细分类) - **总对齐时长**:约78100小时的初始对齐语音-文本数据 - **质量过滤子集**: - 字符错误率(Character Error Rate, CER)<30%:约61000小时 - CER<20%:约50500小时(这是面向所有语言通过Hugging Face Datasets接口直接提供的主要子集) - CER<10%:约32200小时 - **领域**:议会会议记录(正式演讲风格) - **音频片段时长**:通常为3-20秒 - **格式**:带配对转录文本的音频片段 ### 语言分布 EuroSpeech 为此前资源匮乏的语言提供了大量数据: - 19种语言的CER<20%数据量超过1000小时 - 22种语言的CER<20%数据量超过500小时 | 国家 | 语言代码 | 总对齐时长(小时) | CER<30%时长(小时) | CER<20%时长(小时) | CER<10%时长(小时) | |--------------------------|----------|-------------------|---------------------|---------------------|---------------------| | 克罗地亚 | hr | 7484.9 | 5899.7 | 5615.8 | 4592.0 | | 丹麦 | da | 7014.2 | 6435.0 | 5559.8 | 3443.7 | | 挪威 | no | 5326.2 | 4578.8 | 3866.7 | 2252.2 | | 葡萄牙 | pt | 5096.3 | 4036.7 | 3293.5 | 2105.9 | | 意大利 | it | 4812.8 | 3539.6 | 2813.7 | 1767.3 | | 立陶宛 | lt | 5537.9 | 3971.0 | 2681.2 | 956.6 | | 英国 | en | 5212.2 | 3790.7 | 2609.3 | 1175.0 | | 斯洛伐克 | sk | 2863.4 | 2722.4 | 2553.6 | 2070.8 | | 希腊 | el | 3096.7 | 2717.6 | 2395.4 | 1620.9 | | 瑞典 | sv | 3819.4 | 2862.6 | 2312.8 | 1360.1 | | 法国 | fr | 5476.8 | 2972.1 | 2249.8 | 1347.6 | | 保加利亚 | bg | 3419.6 | 2570.4 | 2200.1 | 1472.8 | | 德国 | de | 2472.2 | 2354.2 | 2184.4 | 1698.4 | | 塞尔维亚 | sr | 2263.1 | 1985.1 | 1855.7 | 1374.1 | | 芬兰 | fi | 2130.6 | 1991.4 | 1848.2 | 1442.2 | | 拉脱维亚 | lv | 2047.4 | 1627.9 | 1218.8 | 499.9 | | 乌克兰 | uk | 1287.8 | 1238.3 | 1191.1 | 1029.8 | | 斯洛文尼亚 | sl | 1338.2 | 1241.7 | 1156.4 | 900.5 | | 爱沙尼亚 | et | 1701.1 | 1430.9 | 1014.9 | 382.5 | | 波斯尼亚和黑塞哥维那 | bs | 860.2 | 781.9 | 691.3 | 447.8 | | 冰岛 | is | 1586.1 | 974.1 | 647.4 | 171.4 | | 马耳他 | mt | 3281.6 | 1284.3 | 613.0 | 143.9 | | **总计** | | **78128.6** | **61006.4** | **50572.9** | **32255.5** | ## 数据集结构 ### 数据实例 数据集的每个实例包含: - 3-20秒的音频片段 - 对应的转录文本 - 元数据,包括语言、来源会议、对齐质量指标 ### 数据划分 该数据集为每种语言提供预定义的训练集、开发集与测试集。为确保数据完整性并防止集间数据泄露,划分方式为将完整的议会会议(即源自单条原始长音频录制的所有片段)仅分配至训练、开发或测试集中的一个。具体比例遵循通用实践(如80/10/10)。 ## 数据集构建 ### 源数据 数据采集自22个欧洲国家的议会会议记录。议会会议记录采用正式语体的高质量语音,通常具备清晰的语音、良好的音频质量与专业的转录文本。 ### 数据采集与处理 该数据集通过多阶段流水线构建: 1. **数据获取与元数据收集**:通过人工与脚本化方式从议会网站采集媒体/转录链接。 2. **下载流水线**:使用针对多种源格式的专用处理工具,自动获取音频、视频与转录文件。 3. **对齐流水线**: - 使用语音活动检测(Voice Activity Detection, VAD)将长录音分割为3-20秒的话语片段 - 使用自动语音识别(Automatic Speech Recognition, ASR)模型为片段生成伪标签转录 - 使用新型两阶段动态算法将片段与转录文本对齐 - 选择最优对齐的转录格式并进行质量过滤 4. **过滤**:基于CER的过滤以创建质量层级(CER<30%、<20%、<10%) ### 对齐算法 对齐流程的核心是一款新型两阶段动态算法,专为匹配ASR伪标签与嘈杂、非逐字的议会转录文本时实现极强鲁棒性而设计: 1. **粗匹配阶段**:使用滑动窗口快速扫描转录文本,绕过大量无关段落,通过字符错误率(CER)筛选出Top-k候选文本片段。 2. **精调阶段**:在候选片段周围执行局部搜索,优化起始位置与窗口大小以获得最优CER。 若初始匹配未达到预设质量阈值,则启用回退机制重启搜索。 ## 数据集用途 ### 预期用途 EuroSpeech 数据集适用于: - 自动语音识别(ASR)系统的训练与评估 - 文本转语音(Text-to-Speech, TTS)系统的训练与评估 - 多语言语音研究 - 低资源语言语音技术开发 - 语音模型中的跨语言迁移学习 ### 引用信息 若使用本数据集,请引用: [Citation details to be added upon publication] ## 注意事项 ### 数据质量 数据集提供基于字符错误率(CER)的多个质量层级: - CER<30%:数据量更大,但对齐质量可能较低 - CER<20%:质量与数量的平衡权衡(推荐用于大多数应用场景) - CER<10%:最高质量的对齐结果,但数据量减少 ### 许可信息 各国家的许可条款各不相同,因为每个议会都有自己的政策。下表提供了数据集中各议会的相关来源。 请注意,我们不保证此信息的准确性,且不对任何违反适用许可或法律的使用承担责任。用户需负责确保遵守相关条款。 #### 各议会的版权与许可信息 | 国家 | 来源 | |--------------------------|----------------------------------------------------------------------| | 克罗地亚 | [法律声明](https://www.sabor.hr/index.php/en/legal-notice) | | 丹麦 | [法律声明](https://www.ft.dk/da/aktuelt/tv-fra-folketinget/deling-og-rettigheder#A5FB53FDE08B4CFBA457A63E7B364584) | | 挪威 | [NLOD许可](https://data.norge.no/nlod/en/2.0) | | 葡萄牙 | [葡萄牙版权法](https://www.pgdlisboa.pt/leis/lei_mostra_articulado.php?artigo_id=484A0075&nid=484&tabela=leis&pagina=1&ficha=1&so_miolo=&nversao=#artigo) 第75条 | | 意大利 | [意大利议会官网](https://www.senato.it/) 引用 [CC BY 4.0许可协议](https://creativecommons.org/licenses/by/4.0/legalcode.it) | | 立陶宛 | [立陶宛共和国版权与相关权利法](https://www.wipo.int/edocs/lexdocs/laws/en/lt/lt081en.pdf) 第22条 | | 英国 | [音频使用条款与条件](https://www.parliament.uk/site-information/copyright-parliament/pru-licence-agreements/downloading--sharing-terms--conditions/),[开放政府许可](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/) 适用于转录文本 | | 斯洛伐克 | [斯洛伐克版权法](https://wipolex-res.wipo.int/edocs/lexdocs/laws/en/sk/sk096en.pdf) 第一章第五节5e) | | 希腊 | [希腊版权法](https://eratospe.org/2121_1993_en..pdf) 第2(5)条与第25(1)(b)条 | | 瑞典 | [法律(2022:818)](https://www.riksdagen.se/sv/dokument-och-lagar/dokument/svensk-forfattningssamling/lag-2022818-om-den-offentliga-sektorns_sfs-2022-818/#K2) | | 法国 | [开放许可](https://www.etalab.gouv.fr/licence-ouverte-open-licence/) | | 保加利亚 | [版权政策](https://www.president.bg/static104/Copyright-and-Legal-Policy/?lang=en&skipMobile=1) 引用 [CC BY 2.5 BG许可](https://creativecommons.org/licenses/by/2.5/bg/) | | 德国 | [使用条款](https://www.bundestag.de/resource/blob/296018/45ce89d3a71fea6b068511a93da129bb/nutzungsbedingungen_en.pdf) | | 塞尔维亚 | [塞尔维亚版权与相关权利法](https://wipolex-res.wipo.int/edocs/lexdocs/laws/en/rs/rs061en.html) 第6(2)条 | | 芬兰 | [版权法](https://wipolex-res.wipo.int/edocs/lexdocs/laws/en/fi/fi001en.pdf) 第9、22与25条 | | 拉脱维亚 | [拉脱维亚版权法](https://likumi.lv/ta/en/en/id/5138) 第21节 | | 乌克兰 | [乌克兰版权与相关权利法](https://wipolex-resources-eu-central-1-358922420655.s3.amazonaws.com/edocs/lexdocs/laws/en/ua/ua210en_1.pdf) 第8(1)(3)条 | | 斯洛文尼亚 | [版权与相关权利法](https://wipolex-res.wipo.int/edocs/lexdocs/laws/en/si/si082en.html) 第46-51条 | | 爱沙尼亚 | [版权法](https://www.riigiteataja.ee/en/eli/525112013002/consolide),[爱沙尼亚YouTube频道](https://www.youtube.com/riigikogu) 引用 [CC BY SA许可](https://creativecommons.org/licenses/by-sa/4.0/) | | 波斯尼亚和黑塞哥维那 | [版权法](https://original.co.ba/file/bih-copyright-law/36) 第44与47条 | | 冰岛 | [版权法](https://www.wipo.int/wipolex/en/text/128153) 第22条 | | 马耳他 | [公共部门信息再利用法](https://legislation.mt/eli/cap/546/20250311/eng) 第546章 | ## 局限性 - 本数据集主要涵盖正式的议会演讲,可能无法很好地泛化至非正式、即兴或嘈杂的语音场景。 - 数据集反映了欧洲议员的人口统计特征与演讲风格,可能无法代表普通大众。 - 部分语言的数据量显著多于其他语言,这可能导致多语言模型的性能出现差异。 ## 附加信息 ### 数据集维护者 - Samuel Pfisterer ([@SamuelPfisterer1](https://huggingface.co/SamuelPfisterer1)) - Florian Grötschla ([@FloGr](https://huggingface.co/FloGr)) - Luca Lanzendörfer ([@lucala](https://huggingface.co/lucala)) - Florian Yan ([@floyan](https://huggingface.co/floyan)) - Roger Wattenhofer ### 维护状态 [维护与更新计划相关信息] ### 链接 - [Hugging Face Datasets 上的 EuroSpeech 数据集](https://huggingface.co/datasets/disco-eth/EuroSpeech) - [EuroSpeech GitHub 仓库](https://github.com/SamuelPfisterer/EuroSpeech)
提供机构:
maas
创建时间:
2025-05-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作