mt0rm0/movie_descriptors
收藏Hugging Face2023-12-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/mt0rm0/movie_descriptors
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
task_categories:
- sentence-similarity
language:
- en
pretty_name: '"Movie descriptors for Semantic Search"'
size_categories:
- 10K<n<100K
tags:
- movies
- embeddings
- semantic search
- films
- hpi
- workshop
---
# Dataset Card
This dataset is a subset from Kaggle's The Movie Dataset that contains only name, release year and overview for every film in the original dataset that has that information complete.
It is intended as a toy dataset for learning about embeddings in a workshop from the AI Service Center Berlin-Brandenburg at the Hasso Plattner Institute.
This dataset has a smaller version [here](https://huggingface.co/datasets/mt0rm0/movie_descriptors_small).
## Dataset Details
### Dataset Description
The dataset has 44435 rows and 3 columns:
- 'name': includes the title of the movies
- 'release_year': indicates the year of release
- 'overview': provides a brief description of each movie, used for advertisement.
**Curated by:** [Mario Tormo Romero](https://huggingface.co/mt0rm0)
**Language(s) (NLP):** English
**License:** cc0-1.0
### Dataset Sources
This Dataset is a subset of Kaggle's [The Movie Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset).
We have only used the <kbd>movies_metadata.csv</kbd> file and extracted some features (see Dataset Description) and dropped the rows that didn't were complete.
The original Dataset has a cc0-1.0 License and we have maintained it.
## Uses
This is a toy dataset created for pegagogical purposes, and is used in the **Working with embeddings** Workshop created and organized by the [AI Service Center Berlin-Brandenburg](https://hpi.de/kisz/) at the [Hasso Plattner Institute](https://hpi.de/).
## Dataset Creation
### Curation Rationale
We want to provide with this dataset a fast way of obtaining the required data for our workshops without having to download huge datasets that contain just way too much information.
### Source Data
Our source is Kaggle's The Movie Dataset., so the information comes from the MovieLens Dataset. The dataset consists of movies released on or before July 2017.
#### Data Collection and Processing
The data was downloaded from [Kaggle](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset) as a zip file. The file <kbd>movies_metadata.csv</kbd> was then extracted.
The data was processed with the following code:
```python
import pandas as pd
# load the csv file
df = pd.read_csv("movies_metadata.csv", low_memory=False)
# select the required columns, drop rows with missing values and
# reset the index
df = df.loc[:, ['title', 'release_date', 'overview']]
df = df.dropna(axis=0).reset_index(drop=True)
# make a new column with the release year
df.loc[:, 'release_year'] = pd.to_datetime(df.release_date).dt.year
# select the columns in the desired order
df = df.loc[:, ['title', 'release_year', 'overview']]
# save the data to parquet
df.to_parquet('descriptors_data.parquet')
```
#### Who are the source data producers?
This dataset is an ensemble of data collected by [Rounak Banik](https://www.kaggle.com/rounakbanik) from TMDB and GroupLens.
In particular, the movies metadata has been collected from the TMDB Open API, but the source dataset is not endorsed or certified by TMDb.
提供机构:
mt0rm0
原始信息汇总
数据集卡片
数据集详情
数据集描述
该数据集是从Kaggle的The Movie Dataset中提取的一个子集,仅包含电影的名称、发行年份和概述。数据集包含44435行和3列:
- name: 包含电影的标题
- release_year: 指示电影的发行年份
- overview: 提供每部电影的简要描述,用于广告宣传
语言(NLP): 英语
许可证: cc0-1.0
数据集来源
该数据集是Kaggle的The Movie Dataset的一个子集。我们仅使用了<kbd>movies_metadata.csv</kbd>文件,并提取了一些特征(见数据集描述),并删除了不完整的数据行。
原始数据集具有cc0-1.0许可证,我们已保留该许可证。
数据集创建
数据收集和处理
数据从Kaggle下载为zip文件。然后提取<kbd>movies_metadata.csv</kbd>文件。
数据处理代码如下:
python import pandas as pd
加载csv文件
df = pd.read_csv("movies_metadata.csv", low_memory=False)
选择所需的列,删除缺失值的行并重置索引
df = df.loc[:, [title, release_date, overview]] df = df.dropna(axis=0).reset_index(drop=True)
创建一个新的列,包含发行年份
df.loc[:, release_year] = pd.to_datetime(df.release_date).dt.year
按所需顺序选择列
df = df.loc[:, [title, release_year, overview]]
将数据保存为parquet格式
df.to_parquet(descriptors_data.parquet)
数据源生产者
该数据集是由Rounak Banik从TMDB和GroupLens收集的数据集合。特别是,电影元数据是从TMDB Open API收集的,但源数据集未得到TMDb的认可或认证。



