TMDB 5000 Movie Dataset
收藏www.kaggle.com2017-09-28 更新2025-01-16 收录
下载链接:
https://www.kaggle.com/tmdb/tmdb-movie-metadata
下载链接
链接失效反馈官方服务:
资源简介:
### Background
What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?
This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.
### Data Source Transfer Summary
We (Kaggle) have removed the original version of this dataset per a [DMCA](https://en.wikipedia.org/wiki/Digital_Millennium_Copyright_Act) takedown request from IMDB. In order to minimize the impact, we're replacing it with a similar set of films and data fields from [The Movie Database (TMDb)](themoviedb.org) in accordance with [their terms of use](https://www.themoviedb.org/documentation/api/terms-of-use). The bad news is that kernels built on the old dataset will most likely no longer work.
The good news is that:
- You can port your existing kernels over with a bit of editing. [This kernel](https://www.kaggle.com/sohier/getting-imdb-kernels-working-with-tmdb-data/) offers functions and examples for doing so. You can also find [a general introduction to the new format here](https://www.kaggle.com/sohier/tmdb-format-introduction).
- The new dataset contains full credits for both the cast and the crew, rather than just the first three actors.
- Actor and actresses are now listed in the order they appear in the credits. It's unclear what ordering the original dataset used; for the movies I spot checked it didn't line up with either the credits order or IMDB's stars order.
- The revenues appear to be more current. For example, IMDB's figures for Avatar seem to be from 2010 and understate the film's global revenues by over $2 billion.
- Some of the movies that we weren't able to port over (a couple of hundred) were just bad entries. For example, [this IMDB entry](http://www.imdb.com/title/tt5289954/?ref_=fn_t...) has basically no accurate information at all. It lists Star Wars Episode VII as a documentary.
### Data Source Transfer Details
- Several of the new columns contain json. You can save a bit of time by porting the load data functions [from this kernel]().
- Even in simple fields like runtime may not be consistent across versions. For example, previous dataset shows the duration for Avatar's extended cut while TMDB shows the time for the original version.
- There's now a separate file containing the full credits for both the cast and crew.
- All fields are filled out by users so don't expect them to agree on keywords, genres, ratings, or the like.
- Your existing kernels will continue to render normally until they are re-run.
- If you are curious about how this dataset was prepared, the code to access TMDb's API is posted [here](https://gist.github.com/SohierDane/4a84cb96d220fc4791f52562be37968b).
New columns:
- homepage
- id
- original_title
- overview
- popularity
- production_companies
- production_countries
- release_date
- spoken_languages
- status
- tagline
- vote_average
Lost columns:
- actor_1_facebook_likes
- actor_2_facebook_likes
- actor_3_facebook_likes
- aspect_ratio
- cast_total_facebook_likes
- color
- content_rating
- director_facebook_likes
- facenumber_in_poster
- movie_facebook_likes
- movie_imdb_link
- num_critic_for_reviews
- num_user_for_reviews
### Open Questions About the Data
There are some things we haven't had a chance to confirm about the new dataset. If you have any insights, please let us know in the forums!
- Are the budgets and revenues all in US dollars? Do they consistently show the global revenues?
- This dataset hasn't yet gone through a data quality analysis. Can you find any obvious corrections? For example, in the IMDb version it was necessary to treat values of zero in the budget field as missing. Similar findings would be very helpful to your fellow Kagglers! (It's probably a good idea to keep treating zeros as missing, with the caveat that missing budgets much more likely to have been from small budget films in the first place).
### Inspiration
- Can you categorize the films by type, such as animated or not? We don't have explicit labels for this, but it should be possible to build them from the crew's job titles.
- How sharp is the divide between major film studios and the independents? Do those two groups fall naturally out of a clustering analysis or is something more complicated going on?
### Acknowledgements
This dataset was generated from [The Movie Database](themoviedb.org) API. This product uses the TMDb API but is not endorsed or certified by TMDb.
Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can [try it for yourself here](https://www.themoviedb.org/documentation/api).

### 背景
在电影上映之前,我们如何评判其成功?是否有一些公司(如皮克斯)找到了一种恒定的公式?鉴于耗资超过一亿美元的大型电影仍然可能遭遇失败,这个问题对行业来说比以往任何时候都更为重要。电影爱好者可能有不同的兴趣。我们能否预测哪些电影将获得高度评价,无论其是否取得了商业上的成功?
这是一个开始深入研究这些问题的绝佳起点,因为我们拥有数千部电影的故事情节、演员阵容、制作团队、预算和收入等数据。
### 数据源转换摘要
鉴于根据 [DMCA](https://en.wikipedia.org/wiki/Digital_Millennium_Copyright_Act) 删除请求,我们(Kaggle)移除了该数据集的原始版本。为了最大限度地减少影响,我们正在用来自 [The Movie Database (TMDb)](themoviedb.org) 的类似电影和数据字段进行替换,符合其 [使用条款](https://www.themoviedb.org/documentation/api/terms-of-use)。不幸的是,基于旧数据集构建的内核很可能会失效。
好消息是:
- 您可以通过一些编辑将现有的内核迁移过来。[此内核](https://www.kaggle.com/sohier/getting-imdb-kernels-working-with-tmdb-data/) 提供了执行此操作的功能和示例。您还可以在此处找到 [新格式的一般介绍](https://www.kaggle.com/sohier/tmdb-format-introduction)。
- 新数据集包含了演员和制作团队的完整名单,而不仅仅是前三位演员。
- 演员和女演员现在按照他们在名单中的顺序列出。原始数据集使用的排序方式尚不清楚;在我检查的电影中,它既不与名单顺序相符,也不与 IMDB 的明星顺序相符。
- 收入似乎更为准确。例如,IMDB 对《阿凡达》的估计似乎来自 2010 年,并且低估了该电影全球收入超过 20 亿美元。
- 我们无法迁移的一些电影(大约有几百部)只是质量较差的条目。例如,[此 IMDB 条目](http://www.imdb.com/title/tt5289954/?ref_=fn_t...) 几乎没有任何准确信息,它将《星球大战:原力觉醒》列为纪录片。
### 数据源转换详细说明
- 新增的几个列包含 格式。您可以通过迁移 [此内核](()) 中的加载数据函数来节省一些时间。
- 即使在如运行时间等简单的字段中,也可能在不同的版本中不一致。例如,以前的数据库显示了《阿凡达》扩展版的时长,而 TMDB 显示的是原始版本的时长。
- 现在有一个单独的文件,包含演员和制作团队的完整名单。
- 所有字段均由用户填写,因此不要期望它们在关键词、流派、评分等方面达成一致。
- 您现有的内核将继续正常渲染,直到它们被重新运行。
- 如果您对如何准备此数据集感兴趣,访问 TMDb API 的代码已发布在 [此处](https://gist.github.com/SohierDane/4a84cb96d220fc4791f52562be37968b)。
新增列:
- homepage
- id
- original_title
- overview
- popularity
- production_companies
- production_countries
- release_date
- spoken_languages
- status
- tagline
- vote_average
丢失列:
- actor_1_facebook_likes
- actor_2_facebook_likes
- actor_3_facebook_likes
- aspect_ratio
- cast_total_facebook_likes
- color
- content_rating
- director_facebook_likes
- facenumber_in_poster
- movie_facebook_likes
- movie_imdb_link
- num_critic_for_reviews
- num_user_for_reviews
### 关于数据的开放性问题
关于新数据集,我们还有一些尚未确认的事项。如果您有任何见解,请在我们论坛上告知我们!
- 预算和收入是否都是美元?它们是否一致显示了全球收入?
- 此数据集尚未经过数据质量分析。您能找到任何明显的纠正之处吗?例如,在 IMDb 版本中,有必要将预算字段中的零值视为缺失值。类似的发现将对您的同伴 Kagglers 非常有帮助!(可能最好继续将零值视为缺失值,但需注意的是,缺少预算更可能是来自低成本电影的情况)。
### 灵感
- 您能否根据类型对电影进行分类,例如动画或非动画?我们没有显式标签,但应该可以从制作团队的职位中构建它们。
- 大型电影工作室和独立工作室之间的分歧有多尖锐?这两组是否自然地出现在聚类分析中,或者情况更为复杂?
### 致谢
此数据集由 [The Movie Database](themoviedb.org) API 生成。本产品使用 TMDb API,但未经其授权或认证。他们的 API 还提供了对许多其他电影、演员、女演员、制作人员以及电视剧的数据访问。[您可以在此尝试](https://www.themoviedb.org/documentation/api)。

提供机构:
Kaggle
搜集汇总
背景与挑战
背景概述
TMDB 5000 Movie Dataset是一个包含数千部电影详细信息的集合,数据来自The Movie Database,涵盖情节、演员、预算和收入等字段,旨在分析电影成功因素和预测评级。数据集从IMDB转移而来,提供了更全面的演员和剧组列表以及更新的收入信息,但存在数据不一致性,适合用于电影行业研究和机器学习项目。
以上内容由遇见数据集搜集并总结生成



