five

mt0rm0/movie_descriptors

收藏
Hugging Face2023-12-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mt0rm0/movie_descriptors
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 task_categories: - sentence-similarity language: - en pretty_name: '"Movie descriptors for Semantic Search"' size_categories: - 10K<n<100K tags: - movies - embeddings - semantic search - films - hpi - workshop --- # Dataset Card This dataset is a subset from Kaggle's The Movie Dataset that contains only name, release year and overview for every film in the original dataset that has that information complete. It is intended as a toy dataset for learning about embeddings in a workshop from the AI Service Center Berlin-Brandenburg at the Hasso Plattner Institute. This dataset has a smaller version [here](https://huggingface.co/datasets/mt0rm0/movie_descriptors_small). ## Dataset Details ### Dataset Description The dataset has 44435 rows and 3 columns: - 'name': includes the title of the movies - 'release_year': indicates the year of release - 'overview': provides a brief description of each movie, used for advertisement. **Curated by:** [Mario Tormo Romero](https://huggingface.co/mt0rm0) **Language(s) (NLP):** English **License:** cc0-1.0 ### Dataset Sources This Dataset is a subset of Kaggle's [The Movie Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset). We have only used the <kbd>movies_metadata.csv</kbd> file and extracted some features (see Dataset Description) and dropped the rows that didn't were complete. The original Dataset has a cc0-1.0 License and we have maintained it. ## Uses This is a toy dataset created for pegagogical purposes, and is used in the **Working with embeddings** Workshop created and organized by the [AI Service Center Berlin-Brandenburg](https://hpi.de/kisz/) at the [Hasso Plattner Institute](https://hpi.de/). ## Dataset Creation ### Curation Rationale We want to provide with this dataset a fast way of obtaining the required data for our workshops without having to download huge datasets that contain just way too much information. ### Source Data Our source is Kaggle's The Movie Dataset., so the information comes from the MovieLens Dataset. The dataset consists of movies released on or before July 2017. #### Data Collection and Processing The data was downloaded from [Kaggle](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset) as a zip file. The file <kbd>movies_metadata.csv</kbd> was then extracted. The data was processed with the following code: ```python import pandas as pd # load the csv file df = pd.read_csv("movies_metadata.csv", low_memory=False) # select the required columns, drop rows with missing values and # reset the index df = df.loc[:, ['title', 'release_date', 'overview']] df = df.dropna(axis=0).reset_index(drop=True) # make a new column with the release year df.loc[:, 'release_year'] = pd.to_datetime(df.release_date).dt.year # select the columns in the desired order df = df.loc[:, ['title', 'release_year', 'overview']] # save the data to parquet df.to_parquet('descriptors_data.parquet') ``` #### Who are the source data producers? This dataset is an ensemble of data collected by [Rounak Banik](https://www.kaggle.com/rounakbanik) from TMDB and GroupLens. In particular, the movies metadata has been collected from the TMDB Open API, but the source dataset is not endorsed or certified by TMDb.
提供机构:
mt0rm0
原始信息汇总

数据集卡片

数据集详情

数据集描述

该数据集是从Kaggle的The Movie Dataset中提取的一个子集,仅包含电影的名称、发行年份和概述。数据集包含44435行和3列:

  • name: 包含电影的标题
  • release_year: 指示电影的发行年份
  • overview: 提供每部电影的简要描述,用于广告宣传

语言(NLP): 英语

许可证: cc0-1.0

数据集来源

该数据集是Kaggle的The Movie Dataset的一个子集。我们仅使用了<kbd>movies_metadata.csv</kbd>文件,并提取了一些特征(见数据集描述),并删除了不完整的数据行。

原始数据集具有cc0-1.0许可证,我们已保留该许可证。

数据集创建

数据收集和处理

数据从Kaggle下载为zip文件。然后提取<kbd>movies_metadata.csv</kbd>文件。

数据处理代码如下:

python import pandas as pd

加载csv文件

df = pd.read_csv("movies_metadata.csv", low_memory=False)

选择所需的列,删除缺失值的行并重置索引

df = df.loc[:, [title, release_date, overview]] df = df.dropna(axis=0).reset_index(drop=True)

创建一个新的列,包含发行年份

df.loc[:, release_year] = pd.to_datetime(df.release_date).dt.year

按所需顺序选择列

df = df.loc[:, [title, release_year, overview]]

将数据保存为parquet格式

df.to_parquet(descriptors_data.parquet)

数据源生产者

该数据集是由Rounak Banik从TMDB和GroupLens收集的数据集合。特别是,电影元数据是从TMDB Open API收集的,但源数据集未得到TMDb的认可或认证。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作