mt0rm0/movie_descriptors

Name: mt0rm0/movie_descriptors
Creator: mt0rm0
Published: 2023-12-04 16:46:54
License: 暂无描述

Hugging Face2023-12-04 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/mt0rm0/movie_descriptors

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 task_categories: - sentence-similarity language: - en pretty_name: '"Movie descriptors for Semantic Search"' size_categories: - 10K<n<100K tags: - movies - embeddings - semantic search - films - hpi - workshop --- # Dataset Card This dataset is a subset from Kaggle's The Movie Dataset that contains only name, release year and overview for every film in the original dataset that has that information complete. It is intended as a toy dataset for learning about embeddings in a workshop from the AI Service Center Berlin-Brandenburg at the Hasso Plattner Institute. This dataset has a smaller version [here](https://huggingface.co/datasets/mt0rm0/movie_descriptors_small). ## Dataset Details ### Dataset Description The dataset has 44435 rows and 3 columns: - 'name': includes the title of the movies - 'release_year': indicates the year of release - 'overview': provides a brief description of each movie, used for advertisement. **Curated by:** [Mario Tormo Romero](https://huggingface.co/mt0rm0) **Language(s) (NLP):** English **License:** cc0-1.0 ### Dataset Sources This Dataset is a subset of Kaggle's [The Movie Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset). We have only used the <kbd>movies_metadata.csv</kbd> file and extracted some features (see Dataset Description) and dropped the rows that didn't were complete. The original Dataset has a cc0-1.0 License and we have maintained it. ## Uses This is a toy dataset created for pegagogical purposes, and is used in the **Working with embeddings** Workshop created and organized by the [AI Service Center Berlin-Brandenburg](https://hpi.de/kisz/) at the [Hasso Plattner Institute](https://hpi.de/). ## Dataset Creation ### Curation Rationale We want to provide with this dataset a fast way of obtaining the required data for our workshops without having to download huge datasets that contain just way too much information. ### Source Data Our source is Kaggle's The Movie Dataset., so the information comes from the MovieLens Dataset. The dataset consists of movies released on or before July 2017. #### Data Collection and Processing The data was downloaded from [Kaggle](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset) as a zip file. The file <kbd>movies_metadata.csv</kbd> was then extracted. The data was processed with the following code: ```python import pandas as pd # load the csv file df = pd.read_csv("movies_metadata.csv", low_memory=False) # select the required columns, drop rows with missing values and # reset the index df = df.loc[:, ['title', 'release_date', 'overview']] df = df.dropna(axis=0).reset_index(drop=True) # make a new column with the release year df.loc[:, 'release_year'] = pd.to_datetime(df.release_date).dt.year # select the columns in the desired order df = df.loc[:, ['title', 'release_year', 'overview']] # save the data to parquet df.to_parquet('descriptors_data.parquet') ``` #### Who are the source data producers? This dataset is an ensemble of data collected by [Rounak Banik](https://www.kaggle.com/rounakbanik) from TMDB and GroupLens. In particular, the movies metadata has been collected from the TMDB Open API, but the source dataset is not endorsed or certified by TMDb.

提供机构：

mt0rm0

原始信息汇总

数据集卡片

数据集详情

数据集描述

该数据集是从Kaggle的The Movie Dataset中提取的一个子集，仅包含电影的名称、发行年份和概述。数据集包含44435行和3列：

name: 包含电影的标题
release_year: 指示电影的发行年份
overview: 提供每部电影的简要描述，用于广告宣传

语言(NLP): 英语

许可证: cc0-1.0

数据集来源

该数据集是Kaggle的The Movie Dataset的一个子集。我们仅使用了<kbd>movies_metadata.csv</kbd>文件，并提取了一些特征（见数据集描述），并删除了不完整的数据行。

原始数据集具有cc0-1.0许可证，我们已保留该许可证。

数据集创建

数据收集和处理

数据从Kaggle下载为zip文件。然后提取<kbd>movies_metadata.csv</kbd>文件。

数据处理代码如下：

python import pandas as pd

加载csv文件

df = pd.read_csv("movies_metadata.csv", low_memory=False)

选择所需的列，删除缺失值的行并重置索引

df = df.loc[:, [title, release_date, overview]] df = df.dropna(axis=0).reset_index(drop=True)

创建一个新的列，包含发行年份

df.loc[:, release_year] = pd.to_datetime(df.release_date).dt.year

按所需顺序选择列

df = df.loc[:, [title, release_year, overview]]

将数据保存为parquet格式

df.to_parquet(descriptors_data.parquet)

数据源生产者

该数据集是由Rounak Banik从TMDB和GroupLens收集的数据集合。特别是，电影元数据是从TMDB Open API收集的，但源数据集未得到TMDb的认可或认证。

5,000+

优质数据集

54 个

任务类型

进入经典数据集