five

Latin-Audio

收藏
魔搭社区2026-01-06 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Latin-Audio
下载链接
链接失效反馈
官方服务:
资源简介:
## Dataset Summary Vox Classica is a Latin speech corpus of ~73 hours of audio, segmented into short audio clips by sentence. Vox Classica is a large-scale, ML-ready dataset of human-read Classical Latin. It was designed to address the absence of a publicly available human-read Latin corpus large enough for model training. - **Alignment and curation:** Kaiyuan Zhao - **Language:** Latin (Classical) ## Uses This dataset is built for training and evaluating speech processing models for Classical Latin. Its primary intended use is for training and fine-tuning Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models. ## Dataset Structure Each example in the dataset represents roughly 1-2 sentences and contains the following fields: - **`transcription`**: A `string` containing the gold-standard Latin sentence. The text has been automatically macronized using the [CLTK macronizer](https://docs.cltk.org/en/latest/latin.html#macronizer) to provide correct vowel length information. - **`audio`**: An mp3 feature containing the spoken version of the text. ## Dataset Creation To split the long-form audio recordings and the corresponding "gold standard" texts into short audio clips, the gold text was segmented into individual sentences. To approximate the location of each sentence within the long audio file, a untrained whisper-large-v3 model was used to generate a rough transcript of the entire audio. A fuzzy string matching algorithm was used to find the location of each gold-standard sentence within the noisy Whisper transcript, providing estimated start and end timestamps. The curator personally verified and adjusted the segment endpoints to ensure they precisely aligned with the sentence text. # Credits and Acknowledgements Audio sourced (with proper licensing or explicit informed permission) from: - [Dickinson College Commentaries](https://dcc.dickinson.edu/) - Western Washington University's [Nuntili Latini](https://nuntiilatini.com/) - [Luke Ranieri](https://www.youtube.com/@ScorpioMartianus) - Stefano Vittori - [Latinum](https://www.youtube.com/channel/UCEekt9eu1g-yEGq6XUlRSIg) - [Onagrus](https://www.youtube.com/@Onagrus-qf2ud) - [Satura Lanx](https://www.youtube.com/c/SaturaLanx) - [ThePrinceSterling](https://www.youtube.com/user/ThePrinceSterling) - [Librivox](https://librivox.org/) Additional thanks to Professor Christopher Francese, William Mattingly, and George Backhouse for their advice and mentorship throughout this process.

## 数据集概述 Vox Classica是一个时长约73小时的拉丁语语音语料库,按句子分割为短音频片段。该数据集为大规模适配机器学习(Machine Learning, ML)的古典拉丁语人工朗读数据集,旨在填补公开可用、规模足以支撑模型训练的人工朗读拉丁语语料库的空白。 - **对齐与整理**:赵开元(Kaiyuan Zhao) - **语言**:古典拉丁语 ## 应用场景 本数据集专为古典拉丁语语音处理模型的训练与评估构建,核心用途为训练及微调自动语音识别(Automatic Speech Recognition, ASR)与文本转语音(Text-to-Speech, TTS)模型。 ## 数据集结构 数据集中的每个样本对应约1-2个句子,包含以下字段: - **`transcription`**:字符串类型,存储标准拉丁语原文。文本已通过[CLTK元音长音标注器(CLTK macronizer)](https://docs.cltk.org/en/latest/latin.html#macronizer)自动添加长音标记,以提供准确的元音长度信息。 - **`audio`**:mp3格式的音频特征,对应文本的朗读版本。 ## 数据集构建流程 为将长音频录音及对应的“金标准”文本分割为短音频片段,首先将金标准文本拆分为独立句子。为估算每个句子在长音频文件中的位置,使用未经过微调的Whisper-large-v3模型生成整段音频的粗略转录结果。随后通过模糊字符串匹配算法,在该粗糙的Whisper转录结果中定位每个金标准句子的位置,得到估算的起始与结束时间戳。最终由数据集整理者人工核验并调整片段端点,确保其与对应句子文本精确对齐。 ## 素材来源与鸣谢 音频素材均通过合法授权或明确知情同意获取,来源包括: - [迪金森学院注释项目(Dickinson College Commentaries)](https://dcc.dickinson.edu/) - 西华盛顿大学[Nuntili Latini](https://nuntiilatini.com/) - [Luke Ranieri](https://www.youtube.com/@ScorpioMartianus) - Stefano Vittori - [Latinum](https://www.youtube.com/channel/UCEekt9eu1g-yEGq6XUlRSIg) - [Onagrus](https://www.youtube.com/@Onagrus-qf2ud) - [Satura Lanx](https://www.youtube.com/c/SaturaLanx) - [ThePrinceSterling](https://www.youtube.com/user/ThePrinceSterling) - [Librivox(利布罗沃克斯)](https://librivox.org/) 特别感谢Christopher Francese教授、William Mattingly与George Backhouse在项目全程提供的指导与支持。
提供机构:
maas
创建时间:
2025-10-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作