iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/10515990
下载链接
链接失效反馈官方服务:
资源简介:
ABSTRACT
---------------Rumble has emerged as a prominent platform hosting controversial figures facing restrictions on YouTube. Despite this, the academic community’s engagement with Rumble has been minimal. To help researchers address this gap, we introduce a comprehensive dataset of about 6.7K podcast videos from August 2020 to December 2022, amounting to over 5.6K hours of content. Besides covering metadata of these podcast videos, we provide speech-to-text transcriptions for future analysis. We also provide speaker diarization information, a collection of ~250K unique representative images from podcast videos, and face embeddings of ~400K extracted faces. With the rise of the influence of podcasts and populist figures, this dataset provides a rich resource for identifying challenges in cyber social threats in a relatively underexplored space.
Rumble platform: http://rumble.com/
Link to paper: https://workshop-proceedings.icwsm.org/abstract.php?id=2024_07
License: CC BY-NC-SA 4.0
Dataset Summary
iDRAMA-rumble-2024 is a large-scale dataset of 6,735 podcast videos from Rumble, an alternative Youtube-like platform. Using state-of-the-art models, we extract information across three modalities: 1) text, 2) audio, and 3) video. We detail the methodology for extracting information from podcast videos in the paper and release a first-of-its-kind dataset including data from different modalities:
Metadata: Details about podcast videos, e.g., channel name, video name, video description, and more.
Text: Transcription (i.e., speech-to-text) of podcast videos.
Audio: Speaker diarization information providing speaker detection over time for each video.
Video: Sampled representative video frames from each video, totaling 200K images. We also detect ~400K non-unique faces from these images and release face embeddings.
Repository links
Zenodo: On Zenodo, we provide JSON formatted dataset for all modalities and representative images in compressed files.
Github: The main repository of this dataset, where we provide code snippets to get started with this dataset.
Link here: https://github.com/idramalab/iDRAMA-rumble-2024
Huggingface: On Huggingface, we provide a dataset that can be accessed through Huggingface APIs in a `parquet` format.
Link here: https://hf.co/datasets/iDRAMALab/iDRAMA-rumble-2024
Dataset Info
The dataset is organized by modalities -- transcripts, representative images, speaker diarization, and face embeddings.
Config
Data-points
Podcast videos
6,735
Representative images
252,387
Face embeddings
399,333
Transcripts & Speaker diarization
6,735
Zenodo Dataset Files Info
#Files
File names
Metadata
1
iDRAMA-rumble-2024-metadata.ndjson
Speaker diarization
1
iDRAMA-rumble-2024-speaker-dirization.zip
Face embeddings
1
iDRAMA-rumble-2024-face-embeddings.ndjson
Representation images
5
iDRAMA-rumble-2024-repr-images-set1.tar.gz
iDRAMA-rumble-2024-repr-images-set2.tar.gz
iDRAMA-rumble-2024-repr-images-set3.tar.gz
iDRAMA-rumble-2024-repr-images-set4.tar.gz
iDRAMA-rumble-2024-repr-images-set5.tar.gz
Transcription Lite
(Minimal information)
3
iDRAMA-rumble-2024-transcription-lite_part_1.ndjson
iDRAMA-rumble-2024-transcription-lite_part_2.ndjson
iDRAMA-rumble-2024-transcription-lite_part_3.ndjson
Transcription
3
iDRAMA-rumble-2024-transcription_part_1.ndjson
iDRAMA-rumble-2024-transcription_part_2.ndjson
iDRAMA-rumble-2024-transcription_part_3.ndjson
Authorship
This dataset is published in the "Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media" hosted in Buffalo, NY, USA.
Academic Organization: iDRAMA Lab
Authors: Utkucan Balci, Jay Patel, Berkan Balci, Jeremy Blackburn
Affiliation: Binghamton University, Middle East Technical University
Licensing
This dataset is available for free to use under terms of the non-commercial license CC BY-NC-SA 4.0.
Citation
@article{balci2024idrama, title = {iDRAMA-rumble-2024: A Dataset of Podcasts from Rumble Spanning 2020 to 2022}, author = {Balci, Utkucan and Patel, Jay and Balci, Berkan and Blackburn, Jeremy}, year = {2024}, journal = {Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media}}
创建时间:
2024-06-27



