five

itzune/antton-dataset

收藏
Hugging Face2026-03-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/itzune/antton-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - eu pretty_name: Antton Dataset size_categories: - 10k<n<100k task_categories: - text-to-speech - automatic-speech-recognition tags: - audio - TTS - Basque - Aholab - Ilenia - synthetic - common-voice base_model: - itzune/antton-tts --- # Antton Dataset (Synthetic) This is a large-scale **synthetic speech corpus** designed for training and fine-tuning Basque Text-to-Speech (TTS) models. It consists of **99,996 audio files** synthesized from the "Antton" voice model. This dataset was generated by **Itzune** and serves as the primary source for training the [itzune/antton-tts (Piper version)](https://huggingface.co/itzune/antton-tts) model. ## Dataset Structure Due to the large volume of data (approx. 100,000 files), the dataset is organized in the **WebDataset** format. The audio files are bundled into `.tar` shards to optimize storage, I/O performance, and streaming. ### Files - **data/**: Directory containing the `.tar` shards. - **metadata.csv**: The main metadata file using `|` as a delimiter: - `file_name`: The name of the audio file (e.g., `audio_1.wav`). - `transcription`: The corresponding Basque text. ## Technical Specifications - **Audio Format:** WAV (PCM) - **Sample Rate:** 22050 Hz - **Language:** Basque (eu) - **Voice Profile:** Antton (Male) - **Text Source:** [Mozilla Common Voice - Basque Sentence Collection](https://datacollective.mozillafoundation.org/datasets/cmj8u3p2v007tnxxbk5ng5qvh) - **Generation Method:** Synthesized using VITS-based architecture. ## Usage ```python from datasets import load_dataset dataset = load_dataset("itzune/antton-dataset", streaming=True) sample = next(iter(dataset["train"])) print(f"Text: {sample['transcription']}") ``` ## Credits and Licensing ### Source and Methodology This is a synthetic dataset generated by Itzune. The synthesis process involved: - **Text Acquisition**: Sentences were sourced from the Mozilla Common Voice project (Basque sentence collection). - **Audio Synthesis**: The audio was produced using the aHoTTS synthesis tools and the pre-trained Antton (VITS) model developed by HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory. ### Acknowledgments Mozilla Common Voice: For providing the community-driven sentence collection. HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory: For the underlying synthesis technology and the Antton voice model. Project ILENIA: The original Maider voice resource was developed with funding from Project ILENIA. ### License Dataset Content (Audio & Text): Licensed under Creative Commons Attribution 4.0 International (CC BY 4.0). Original Tools/Code: The aHoTTS tools used to generate this data are licensed under the Apache License 2.0. ## Citation If you use this dataset, please cite the original work from HiTZ/Aholab: > García, V., Hernáez, I., & Navas, E. (2022). Evaluation of Tacotron Based Synthesizers for Spanish and Basque. Applied Sciences, 12(3), 1686. https://doi.org/10.3390/app12031686
提供机构:
itzune
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作