five

mt0rm0/movie_descriptors_small

收藏
Hugging Face2023-12-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mt0rm0/movie_descriptors_small
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 task_categories: - sentence-similarity language: - en pretty_name: '"Movie descriptors for Semantic Search"' size_categories: - 10K<n<100K tags: - movies - embeddings - semantic search - films - hpi - workshop --- # Dataset Card This dataset is a subset from Kaggle's The Movie Dataset that contains only name, release year and overview for some movies from the original dataset. It is intended as a toy dataset for learning about embeddings in a workshop from the AI Service Center Berlin-Brandenburg at the Hasso Plattner Institute. This dataset has a bigger version [here](https://huggingface.co/datasets/mt0rm0/movie_descriptors). ## Dataset Details ### Dataset Description The dataset has 28655 rows and 3 columns: - 'name': includes the title of the movies - 'release_year': indicates the year of release - 'overview': provides a brief description of each movie, used for advertisement. The source dataset was filtered for keeping only movies with complete metadata in the required fields, a vote average of at least 6, with more than 100 votes and with a revenue over 2 Million Dollars. **Curated by:** [Mario Tormo Romero](https://huggingface.co/mt0rm0) **Language(s) (NLP):** English **License:** cc0-1.0 ### Dataset Sources This Dataset is a subset of Kaggle's [The Movie Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset). We have only used the <kbd>movies_metadata.csv</kbd> file and extracted some features (see Dataset Description) and dropped the rows that didn't were complete. The original Dataset has a cc0-1.0 License and we have maintained it. ## Uses This is a toy dataset created for pegagogical purposes, and is used in the **Working with embeddings** Workshop created and organized by the [AI Service Center Berlin-Brandenburg](https://hpi.de/kisz/) at the [Hasso Plattner Institute](https://hpi.de/). ## Dataset Creation ### Curation Rationale We want to provide with this dataset a fast way of obtaining the required data for our workshops without having to download huge datasets with just way too much information. ### Source Data Our source is Kaggle's The Movie Dataset, so the information comes from the MovieLens Dataset. The dataset consists of movies released on or before July 2017. #### Data Collection and Processing The data was downloaded from [Kaggle](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset) as a zip file. The file <kbd>movies_metadata.csv</kbd> was then extracted. The data was processed with the following code: ```python import pandas as pd # load the csv file df = pd.read_csv("movies_metadata.csv", low_memory=False) # filter movies according to: # - vote average of at least 6 # - more than 100 votes # - revenue over 2M$ df = df.loc[(df.vote_average >= 6)&(df.vote_count > 100)&(df.revenue > 2e6)] # select the required columns, drop rows with missing values and # reset the index df = df.loc[:, ['title', 'release_date', 'overview']] df = df.dropna(axis=0).reset_index(drop=True) # make a new column with the release year df.loc[:, 'release_year'] = pd.to_datetime(df.release_date).dt.year # select the columns in the desired order df = df.loc[:, ['title', 'release_year', 'overview']] # save the data to parquet df.to_parquet('descriptors_data.parquet') ``` #### Who are the source data producers? The source dataset is an ensemble of data collected by [Rounak Banik](https://www.kaggle.com/rounakbanik) from TMDB and GroupLens. In particular, the movies metadata has been collected from the TMDB Open API, but the source dataset is not endorsed or certified by TMDb.

This dataset is a subset from Kaggles The Movie Dataset, containing 28655 rows of movie names, release years, and overviews. It is filtered to include only movies with complete metadata, a vote average of at least 6, more than 100 votes, and a revenue over 2 Million Dollars. This dataset is used for learning about embeddings in workshops, aiming to provide a quick way to obtain the necessary data for the workshops without downloading large datasets with excessive information.
提供机构:
mt0rm0
原始信息汇总

数据集卡片

数据集详情

数据集描述

该数据集是从Kaggle的The Movie Dataset中提取的一个子集,仅包含部分电影的名称、发布年份和概述。数据集包含28655行和3列:

  • name: 包含电影的标题
  • release_year: 指示电影的发布年份
  • overview: 提供每部电影的简要描述,用于广告宣传

源数据集经过筛选,仅保留了在所需字段中具有完整元数据、投票平均值至少为6、投票数超过100且收入超过200万美元的电影。

语言(NLP): 英语

许可证: cc0-1.0

数据集来源

该数据集是Kaggle的The Movie Dataset的一个子集。我们仅使用了<kbd>movies_metadata.csv</kbd>文件,并提取了一些特征(见数据集描述),并删除了不完整的行。

原始数据集具有cc0-1.0许可证,我们已保留该许可证。

数据集创建

数据收集和处理

数据从Kaggle下载为zip文件。然后提取<kbd>movies_metadata.csv</kbd>文件。数据处理代码如下:

python import pandas as pd

加载csv文件

df = pd.read_csv("movies_metadata.csv", low_memory=False)

根据以下条件过滤电影:

- 投票平均值至少为6

- 投票数超过100

- 收入超过200万美元

df = df.loc[(df.vote_average >= 6)&(df.vote_count > 100)&(df.revenue > 2e6)]

选择所需列,删除缺失值的行并重置索引

df = df.loc[:, [title, release_date, overview]] df = df.dropna(axis=0).reset_index(drop=True)

创建包含发布年份的新列

df.loc[:, release_year] = pd.to_datetime(df.release_date).dt.year

按所需顺序选择列

df = df.loc[:, [title, release_year, overview]]

将数据保存为parquet格式

df.to_parquet(descriptors_data.parquet)

源数据生产者

源数据集是由Rounak Banik从TMDB和GroupLens收集的数据集合。特别是,电影元数据是从TMDB开放API收集的,但源数据集未得到TMDb的认可或认证。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作