five

AI Blob Dataset: Transcribed and Semantically Embedded Italian Television Archive (Video Metadata, Sentence Annotations, Vector Store)

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/15071950
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset for the paper  AI Blob! LLM-Driven Recontextualization of Italian Television Archives presented at  Media Mutations 16 International Conference - Unlocking Television Archives in the Digital Era Bologna, Dipartimento delle Arti – Palazzo Marescotti, May 26th-27th, 2025Organised by Luca Barra, Matteo Marinello, Emiliano Rossi (Università di Bologna),Susanne Eichner (Filmuniversität Babelsberg Konrad Wolf, Potsdam) andAnne-Katrin Weber (Université de Lausanne)   📦 Description for Zenodo Dataset: This dataset accompanies the AI Blob! project and is intended to support research in AI-assisted archival studies, television historiography, and computational media analysis. It consists of three primary components derived from a curated collection of Italian television footage and processed through a semantic retrieval pipeline employing automatic speech recognition (ASR) and vector-based embedding. 1. Video Metadata and Source List The first component is a structured list of 1,547 unique videos sourced from two publicly accessible repositories: The ITTV dataset (Mezza et al. 2023), originally collected for automatic television genre classification. The Indimenticabile TV YouTube channel, which archives classic Italian television clips. To ensure the consistency, accessibility, and linguistic relevance of the dataset, a two-step filtering process was applied to the raw video list: Link Validation:All videos that were no longer available on YouTube at the time of collection (January 2025) were removed from the dataset. This ensures that each entry includes an active and publicly accessible video URL. Language Filtering:To retain only Italian-language content, we applied an automatic language classification process using FastText, a deep learning model for text classification. The model analyzed a representative text sample (e.g., title and transcript, if available) from each video. Videos in which Italian was not the dominant language were excluded. Each video is described using standardized metadata in JSON format, including: video_id: the YouTube identifier url: direct link to the original source duration: video length in seconds genre: genre classification (e.g., "news", "music", "talk show") channel_name: name of the YouTube channel hosting the content This metadata serves as the foundation for indexing and organizing the audiovisual corpus. 2. Sentence-Level Transcriptions with Time-Aligned Metadata The second component contains sentence-level transcriptions extracted using the WhisperX ASR model, which provides high-accuracy transcription along with word-level timestamping. Each sentence is represented in a structured JSON format with the following fields: sentence_number: the sentence’s index within the source video sentence: the transcribed text in Italian start_time and end_time: sentence-level timestamps (in seconds) duration: sentence duration words: a list of word-level entries with start/end timings video_id, genre, url, channel_name: inherited video metadata In total, the dataset includes 212,696 individual sentences, enabling fine-grained semantic search and narrative recombination across a diverse range of television genres. 3. ChromaDB Vector Store for Semantic Retrieval The third component is a compressed archive containing a ChromaDB vector store. This database was constructed using the Embed Multilingual V3 model (Cohere) to generate dense vector representations of each transcribed sentence. Each document in the vector store is built with the following structure: page_content: the sentence text metadata: a JSON object including the sentence’s video ID, genre, timestamps, YouTube URL, sentence number, and a serialized list of word-level alignments The Vector Store has been initialized with these values: collection_metadata={             "hnsw:space": "cosine",             "hnsw:search_ef": 200,             "hnsw:M": 30         } This vector store allows for efficient semantic retrieval using similarity-based querying, enabling applications in: automatic thematic montage construction, retrieval-augmented generation (RAG), AI-driven media historiography, creative recombination and recontextualization of archival material. Use Cases and Applications The dataset supports a wide range of interdisciplinary research applications at the intersection of: Media and television studies (e.g., the reconstruction of editorial strategies like those used in Blob) Natural language processing (e.g., sentence segmentation, irony detection) Digital humanities (e.g., AI-assisted archival practices) Multimodal AI research (e.g., future extensions involving image or audio embeddings) All data is published under fair use principles for research and educational purposes. No video content is redistributed directly; only metadata and timestamped transcriptions are provided, along with links to publicly accessible YouTube videos.
创建时间:
2025-03-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作