five

cinepile

收藏
魔搭社区2026-04-28 更新2024-06-08 收录
下载链接:
https://modelscope.cn/datasets/swift/cinepile
下载链接
链接失效反馈
官方服务:
资源简介:
# CinePile: A Long Video Question Answering Dataset and Benchmark CinePile is a question-answering-based, long-form video understanding dataset. It has been created using advanced large language models (LLMs) with human-in-the-loop pipeline leveraging existing human-generated raw data. It consists of approximately 300,000 training data points and 5,000 test data points. If you have any comments or questions, reach out to: [Ruchit Rawal](https://ruchitrawal.github.io/) or [Gowthami Somepalli](https://somepago.github.io/) Other links - [Website](https://ruchitrawal.github.io/cinepile/)   [Paper](https://arxiv.org/abs/2405.08813) ## Version support and revisions - October 2024: We refine both the training and test split using the adversarial refinement process described in detail [here](https://huggingface.co/blog/cinepile2). This refined version will be loaded by default when running `load_dataset("tomg-group-umd/cinepile")`. To load the previous version, use `load_dataset("tomg-group-umd/cinepile", "v1")`. ## Dataset Structure Each row in the dataset consists of a `question` (dtype: string), five `choices` (dtype: list), and an `answer_key` (dtype: string). Auxiliary columns are included that store the movie's name, movie's genre, video clip titles, etc. The train split of the dataset is intended for the instruction tuning of video-LLMs. The test split is designed for benchmarking video-LLMs and includes the `hard_split` column, which is "True" for particularly challenging questions and "False" otherwise. The `visual_reliance` column indicates whether a question likely requires integrating visual information to be answered correctly. ### Dataset Features - **movie_name**: Name of the movie to which the video clip belongs. - **year**: Release year of the movie. - **genre**: Genre(s) of the movie. - **yt_clip_title**: Title of the video clip as it appears on YouTube. - **yt_clip_link**: URL link to the video clip on YouTube. - **movie_scene**: Description of the movie scene, contains subtitles and visual descriptions. - **subtitles**: Subtitles extracted from the movie scene. - **question**: Question derived from the movie scene. - **choices**: Multiple-choice options associated with the question. - **answer_key**: The correct answer from the choices provided. - **answer_key_position**: The index position of the correct answer within the choices list. - **question_category**: The category to which the question belongs. - **hard_split**: Indicates if the question is particularly challenging. "N/A" for the train set; applicable only in the test set. - **visual_reliance**: Indicates if the question requires visual information for an accurate answer. "N/A" for the train set. ## Dataset Use and Starter Snippets ### Loading the dataset You can load the dataset easily using the Datasets library: ``` from datasets import load_dataset dataset = load_dataset("tomg-group-umd/cinepile") ``` ### Retrieving questions from a specific clip ``` cinepile_test = load_dataset('tomg-group-umd/cinepile', token=True, split='test') yt_clip_title = "Extraction (2015) - You're Crazy Scene (5/10) | Movieclips" clip_test_dataset = cinepile_test.filter(lambda x: x['yt_clip_title'] == yt_clip_title) ``` ### Loading the hard-split: ``` cinepile_test = load_dataset('tomg-group-umd/cinepile', token=True, split='test') hard_split_test = cinepile_test.filter(lambda x: x['hard_split'] == "True") ``` Please refer to the accompanying [Colab notebook](https://colab.research.google.com/drive/1jDwvPoCsg9tck3dFhVCV-h3Ny6992wCr?usp=sharing) for more examples e.g. evaluating VLMs, extracting responses, etc. ### Cite us: ``` @article{rawal2024cinepile, title={CinePile: A Long Video Question Answering Dataset and Benchmark}, author={Rawal, Ruchit and Saifullah, Khalid and Basri, Ronen and Jacobs, David and Somepalli, Gowthami and Goldstein, Tom}, journal={arXiv preprint arXiv:2405.08813}, year={2024} } ```

# CinePile:长视频问答数据集与基准测试集 CinePile是一款基于问答机制的长视频理解数据集。其构建依托先进的大语言模型(Large Language Model, LLM),采用人机协同流水线,复用已有的人工生成原始数据。该数据集包含约30万个训练样本与5000个测试样本。 如有任何意见或疑问,请联系:[Ruchit Rawal](https://ruchitrawal.github.io/) 或 [Gowthami Somepalli](https://somepago.github.io/)。其他相关链接:[官网](https://ruchitrawal.github.io/cinepile/) 与 [论文](https://arxiv.org/abs/2405.08813) ## 版本支持与修订 - 2024年10月:我们通过详细阐述于[此处](https://huggingface.co/blog/cinepile2)的对抗性优化流程,对训练集与测试集划分进行了优化。运行`load_dataset("tomg-group-umd/cinepile")`时将默认加载此优化后的版本。如需加载旧版v1,请使用`load_dataset("tomg-group-umd/cinepile", "v1")`。 ## 数据集结构 数据集的每一行包含一个`question`(数据类型:字符串)、五个`choices`(数据类型:列表)与一个`answer_key`(数据类型:字符串)。此外还附带辅助列,用于存储电影名称、电影类型、视频片段标题等信息。 该数据集的训练集划分用于视频大语言模型的指令微调,测试集划分则用于视频大语言模型的基准测试。测试集包含`hard_split`列:若为`True`则表示该问题极具挑战性,为`False`则反之。`visual_reliance`列用于标识问题是否需要结合视觉信息才能正确作答。 ### 数据集特征 - **movie_name**:视频片段所属的电影名称。 - **year**:电影的上映年份。 - **genre**:电影的题材类型。 - **yt_clip_title**:YouTube平台上该视频片段的标题。 - **yt_clip_link**:该视频片段在YouTube上的URL链接。 - **movie_scene**:电影场景描述,包含字幕与视觉细节说明。 - **subtitles**:从该电影场景中提取的字幕内容。 - **question**:基于该电影场景生成的问题。 - **choices**:与该问题关联的多项选择选项。 - **answer_key**:对应选项中的正确答案。 - **answer_key_position**:正确答案在choices列表中的索引位置。 - **question_category**:该问题所属的类别。 - **hard_split**:标识该问题是否极具挑战性。训练集该字段为`N/A`,仅在测试集生效。 - **visual_reliance**:标识该问题是否需要依赖视觉信息才能得到准确答案。训练集该字段为`N/A`,仅在测试集生效。 ## 数据集使用与入门示例 ### 加载数据集 你可以通过Datasets库轻松加载该数据集: from datasets import load_dataset dataset = load_dataset("tomg-group-umd/cinepile") ### 从指定视频片段检索问答样本 cinepile_test = load_dataset('tomg-group-umd/cinepile', token=True, split='test') yt_clip_title = "Extraction (2015) - You're Crazy Scene (5/10) | Movieclips" clip_test_dataset = cinepile_test.filter(lambda x: x['yt_clip_title'] == yt_clip_title) ### 加载困难样本子集 cinepile_test = load_dataset('tomg-group-umd/cinepile', token=True, split='test') hard_split_test = cinepile_test.filter(lambda x: x['hard_split'] == "True") 如需更多示例(如评估视觉语言模型、提取模型响应等),请参考配套的[Colab笔记本](https://colab.research.google.com/drive/1jDwvPoCsg9tck3dFhVCV-h3Ny6992wCr?usp=sharing)。 ### 引用我们 @article{rawal2024cinepile, title={CinePile: A Long Video Question Answering Dataset and Benchmark}, author={Rawal, Ruchit and Saifullah, Khalid and Basri, Ronen and Jacobs, David and Somepalli, Gowthami and Goldstein, Tom}, journal={arXiv preprint arXiv:2405.08813}, year={2024} }
提供机构:
maas
创建时间:
2024-06-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作