five

tomg-group-umd/cinepile

收藏
Hugging Face2024-10-23 更新2024-05-18 收录
下载链接:
https://hf-mirror.com/datasets/tomg-group-umd/cinepile
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-nc-sa-4.0 size_categories: - 100K<n<1M task_categories: - visual-question-answering - video-text-to-text dataset_info: - config_name: default features: - name: movie_name dtype: string - name: year dtype: int64 - name: genre sequence: string - name: yt_clip_title dtype: string - name: yt_clip_link dtype: string - name: movie_scene dtype: string - name: subtitles dtype: string - name: question dtype: string - name: choices sequence: string - name: answer_key dtype: string - name: answer_key_position dtype: int64 - name: question_category dtype: string - name: hard_split dtype: string - name: visual_reliance dtype: string splits: - name: train num_bytes: 1207285134 num_examples: 298888 - name: test num_bytes: 18238920 num_examples: 4941 download_size: 58053414 dataset_size: 1225524054 - config_name: v2 default: True features: - name: movie_name dtype: string - name: year dtype: int64 - name: genre sequence: string - name: yt_clip_title dtype: string - name: yt_clip_link dtype: string - name: movie_scene dtype: string - name: subtitles dtype: string - name: question dtype: string - name: choices sequence: string - name: answer_key dtype: string - name: answer_key_position dtype: int64 - name: question_category dtype: string - name: hard_split dtype: string - name: visual_reliance dtype: string - name: videoID dtype: string splits: - name: train num_bytes: 1226448710 num_examples: 298888 - name: test num_bytes: 18430889 num_examples: 4941 download_size: 69504867 dataset_size: 1244879599 configs: - config_name: v1 data_files: - split: train path: v1/train-* - split: test path: v1/test-* - config_name: v2 default: True data_files: - split: train path: v2/train-* - split: test path: v2/test-* extra_gated_prompt: 'The CinePile dataset provides links to YouTube videos as part of its data collection. CinePile does not own any of the content linked within this dataset. Ownership and copyright of the videos belong to the respective YouTube channel owners. It is the responsibility of these source channels to ensure that all content follows the terms and conditions set by YouTube. By accessing this dataset, you acknowledge and agree that:' extra_gated_fields: I understand that CinePile does not own the YouTube videos linked in this dataset: checkbox I agree to use this dataset for non-commercial use ONLY: checkbox I agree with the data license for this dataset: checkbox --- # CinePile: A Long Video Question Answering Dataset and Benchmark CinePile is a question-answering-based, long-form video understanding dataset. It has been created using advanced large language models (LLMs) with human-in-the-loop pipeline leveraging existing human-generated raw data. It consists of approximately 300,000 training data points and 5,000 test data points. If you have any comments or questions, reach out to: [Ruchit Rawal](https://ruchitrawal.github.io/) or [Gowthami Somepalli](https://somepago.github.io/) Other links - [Website](https://ruchitrawal.github.io/cinepile/) &ensp; [Paper](https://arxiv.org/abs/2405.08813) ## Version support and revisions - October 2024: We refine both the training and test split using the adversarial refinement process described in detail [here](https://huggingface.co/blog/cinepile2). This refined version will be loaded by default when running `load_dataset("tomg-group-umd/cinepile")`. To load the previous version, use `load_dataset("tomg-group-umd/cinepile", "v1")`. ## Dataset Structure Each row in the dataset consists of a `question` (dtype: string), five `choices` (dtype: list), and an `answer_key` (dtype: string). Auxiliary columns are included that store the movie's name, movie's genre, video clip titles, etc. The train split of the dataset is intended for the instruction tuning of video-LLMs. The test split is designed for benchmarking video-LLMs and includes the `hard_split` column, which is "True" for particularly challenging questions and "False" otherwise. The `visual_reliance` column indicates whether a question likely requires integrating visual information to be answered correctly. ### Dataset Features - **movie_name**: Name of the movie to which the video clip belongs. - **year**: Release year of the movie. - **genre**: Genre(s) of the movie. - **yt_clip_title**: Title of the video clip as it appears on YouTube. - **yt_clip_link**: URL link to the video clip on YouTube. - **movie_scene**: Description of the movie scene, contains subtitles and visual descriptions. - **subtitles**: Subtitles extracted from the movie scene. - **question**: Question derived from the movie scene. - **choices**: Multiple-choice options associated with the question. - **answer_key**: The correct answer from the choices provided. - **answer_key_position**: The index position of the correct answer within the choices list. - **question_category**: The category to which the question belongs. - **hard_split**: Indicates if the question is particularly challenging. "N/A" for the train set; applicable only in the test set. - **visual_reliance**: Indicates if the question requires visual information for an accurate answer. "N/A" for the train set. ## Dataset Use and Starter Snippets ### Loading the dataset You can load the dataset easily using the Datasets library: ``` from datasets import load_dataset dataset = load_dataset("tomg-group-umd/cinepile") ``` ### Retrieving questions from a specific clip ``` cinepile_test = load_dataset('tomg-group-umd/cinepile', token=True, split='test') yt_clip_title = "Extraction (2015) - You're Crazy Scene (5/10) | Movieclips" clip_test_dataset = cinepile_test.filter(lambda x: x['yt_clip_title'] == yt_clip_title) ``` ### Loading the hard-split: ``` cinepile_test = load_dataset('tomg-group-umd/cinepile', token=True, split='test') hard_split_test = cinepile_test.filter(lambda x: x['hard_split'] == "True") ``` Please refer to the accompanying [Colab notebook](https://colab.research.google.com/drive/1jDwvPoCsg9tck3dFhVCV-h3Ny6992wCr?usp=sharing) for more examples e.g. evaluating VLMs, extracting responses, etc. ### Cite us: ``` @article{rawal2024cinepile, title={CinePile: A Long Video Question Answering Dataset and Benchmark}, author={Rawal, Ruchit and Saifullah, Khalid and Basri, Ronen and Jacobs, David and Somepalli, Gowthami and Goldstein, Tom}, journal={arXiv preprint arXiv:2405.08813}, year={2024} } ```
提供机构:
tomg-group-umd
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作