five

USPDATRO: Underrepresented Speech Dataset from Romanian language Open Data

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/7898232
下载链接
链接失效反馈
官方服务:
资源简介:
USPDATRO ========== Underrepresented Speech Dataset from Open Data: Case Study on the Romanian Language (USPDATRO) is a manually created Romanian language speech corpus. It was created specifically using speech types that are underrepresented in other speech datasets. Sources for this dataset are represented by open data available on multimedia platforms under a Creative Commons license. The data was manually transcribed and aligned at segment level. In addition to the text and audio files, we offer text annotations (lemmatization, part of speech tags, dependency parsing) in CoNLL-U Plus format. Each datasource is mentioned by URL in the metadata.csv file with associated license (a Creative Commons variant). Dataset structure: - audio: Folder with audio segments in WAV format - text: Folder with corresponding transcriptions - conllup: Folder with corresponding token-based annotations - metadata.csv: Contains information about each segment LICENSING This work (transcriptions, alignment, metadata, annotations) is provided under the license CC BY-NC-SA 4.0 (Attribution-NonCommercial-ShareAlike 4.0 International). The license can be viewed online here: https://creativecommons.org/licenses/by-nc-sa/4.0/ and the full text here: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode . The original works considered for audio sources are available under their respective licenses (Creative Commons variants) as described in the metadata.csv file. CONTACT Research Institute for Artificial Intelligence "Mihai Drăgănescu", Romanian Academy Web: http://www.racai.ro Contact emails: vasile@racai.ro
创建时间:
2023-05-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作