five

Elyadata/Ara-Best-RQ_dataset

收藏
Hugging Face2026-01-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Elyadata/Ara-Best-RQ_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - ar multilinguality: - monolingual pretty_name: Ara-Best-RQ Dataset size_categories: - 1M<n<10M --- # Ara-Best-RQ Dataset ## Dataset Summary This dataset provides **metadata only** for a dialectal Arabic speech corpus constructed from publicly available YouTube videos. It consists exclusively of **YouTube video identifiers** and **audio segment boundaries (start/end timestamps)** designed for **self-supervised speech representation learning**. **No audio or video content is distributed as part of this dataset.** --- ## Dataset Statistics - **Total spoken duration:** 5,639 h 04 min 27 s - **Number of videos:** 41,826 - **Number of speech segments:** 3,860,419 --- ## Supported Tasks - Self-supervised speech representation learning - Speech segmentation - Arabic speech and dialect research --- ## Languages - Arabic (`ar`) --- ## Dataset Structure Each entry includes: - A YouTube video identifier - Segment start time (seconds) - Segment end time (seconds) Users are expected to retrieve the referenced media themselves and extract the corresponding audio segments. --- ## Licensing Information ### Metadata License All metadata (video identifiers and segment boundaries) is released under: **Creative Commons Attribution 4.0 International (CC BY 4.0)** This license applies **only to the metadata** provided in this dataset. ### Referenced Media The referenced audio/video content: - Is **not distributed** - Remains hosted on YouTube - Is subject to the **original license selected by the content uploader** Users are responsible for ensuring compliance with the original content licenses and YouTube’s Terms of Service. --- ## Citation If you use this dataset, please cite the following paper: > **Ara-Best-RQ: Multi Dialectal Arabic SSL** > Haroun Elleuch, Ryan Whetten, Salima Mdhaffar, Yannick Estève, Fethi Bougares > *Accepted at ICASSP 2026* A BibTeX entry will be provided once the proceedings are published. --- ## Ethical Considerations - This dataset references only publicly available content - No personally identifying information is added or inferred - No attempt is made to identify speakers --- ## Limitations - Some referenced videos may become unavailable over time - Audio quality and recording conditions vary - Licensing information for individual videos is not exhaustively archived --- ## Contact For questions, corrections, or takedown requests, please contact the dataset maintainer via the Hugging Face repository.
提供机构:
Elyadata
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作