five

350 000+ movies from themoviedb.org

收藏
www.kaggle.com2017-10-12 更新2025-03-23 收录
下载链接:
https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg
下载链接
链接失效反馈
官方服务:
资源简介:
### Context I love movies.  I tend to avoid marvel-transformers-standardized products, and prefer a mix of classic hollywood-golden-age and obscure polish artsy movies. Throw in an occasional japanese-zombie-slasher-giallo as an alibi. Good movies don't exist without bad movies.  On average I watch 200+ movies each year, with peaks at more than 500 movies. Nine years ago I started to log my movies to avoid watching the same movie twice, and also assign scores. Over the years, it gave me a couple insights on my viewing habits but nothing more than what a tenth-grader would learn at school.  I've recently suscribed to Netflix and it pains me to see the global inefficiency of recommendation systems for people like me, who mostly swear by "La politique des auteurs". It's a term coined by famous new-wave french movie critic André Bazin, meaning that the quality of a movie is essentially linked to the director and it's capacity to execute his vision with his crew. We could debate it depends on movie production pipeline, but let's not for now. Practically, what it means, is that I essentially watch movies from directors who made films I've liked. I suspect Neflix calibrate their recommandation models taking into account the way the "average-joe" chooses a movie. A few months ago I had read a study based on a survey, showing that people chose a movie mostly based on genre (55%), then by leading actors (45%). Director or Release Date were far behind around 10% each. It is not surprising, since most people I know don't care who the director is. Lots of US blockbusters don't even mention it on the movie poster. I am aware that collaborative filtering is based on user proximity , which I believe decreases (or even eliminates) the need to characterize a movie. So here I'm more interested in content based filtering which is based on product proximity for several reasons : - Users tastes are not easily accessible. It is, after all, Netflix treasure chest - Movie offer on Netflix is so bad for someone who likes author's movies that it wouldn't help - Modeling a movie intrinsic qualities is a nice challenge Enough. "*The secret of getting ahead is getting started*" (Mark Twain) ![network graph][1] ### Content The primary source is www.themoviedb.org. If you watch obscure artsy romanian homemade movies you may find only 95% of your movies referenced...but for anyone else it should be in the 98%+ range. - movies details are from www.themoviedb.org API : movies/details - movies crew & casting are from www.themoviedb.org API : movies/credits - both can be joined by id - they contain all 350k movies up, from end of 19th century to august 2017. If you remove short movies from imdb you get similar amounts of movies. - I uploaded the program to retrieve incremental movie details on github : https://github.com/stephanerappeneau/scienceofmovies/tree/master/PycharmProjects/GetAllMovies (need a dev API key from themoviedb.org though) - I have tried various supervised (decision tree) / unsupervised (clustering, NLP) approaches described in the discussions, source code is on github : https://github.com/stephanerappeneau/scienceofmovies - As a bonus I've uploaded the bio summary from top 500 critically-acclaimed directors from wikipedia, for some interesting NLTK analysis Here is overview of the available sources that I've tried : **• Imdb.com free csv dumps** (ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/) are badly documented, incomplete, loosely structured and impossible to join/merge. There's an API hosted by Amazon Web Service : 1€ every 100 000 requests. With around 1 million movies, it could become expensive also features are bare. So I've searched for other sources.  **• www.themoviedb.org** is based on crowdsourcing and has an excellent API, limited to 40 requests every 10 seconds. It is quite generous, well documented, and enough to sweep the 450 000 movies in a few days. For my purpose, data quality is not significantly worse than imdb, and as imdb key is also included there's always the possibility to complete my dataset later (I actually did it) **• www.Boxofficemojo.com** has some interesting budget/revenue figures (which are sorely lacking in both imdb & tmdb), but it actually tracks only a few thousand movies, mainly blockbusters. There are other professional sources that are used by film industry to get better predictive / marketing insights but that's beyond my reach for this experiment.   **• www.wikipedia.com** is an interesting source with no real cap on API calls, however it requires a bit of webscraping and for movies or directors the layout and quality varies a lot, so I suspected it'd get a lot of work to get insights so I put this source in lower priority. **• www.google.com** will ban you after a few minutes of web scraping because their job is to scrap data from others, than sell it, duh.   • It's worth mentionning that there are a **few dumps of Netflix** anonymized user tastes on kaggle, because they've organised a few competitions to improve their recommendation models. https://www.kaggle.com/netflix-inc/netflix-prize-data • Online databases are largely white anglo-saxon centric, meaning bollywood (India is the 2nd bigger producer of movies) offer is mostly absent from datasets. I'm fine with that, as it's not my cup of tea plus I lack domain knowledge. The sheer amount of indian movies would probably skew my results anyway (I don't want to have too many martial-arts-musicals in my recommendations ;-)). I have, however, tremendous respect for indian movie industry so I'd love to collaborate with an indian cinephile ! ![Westerns][2] ### Inspiration Starting from there, I had multiple problem statements for both supervised / unsupervised machine learning - Can I program a tailored-recommendation system based on my own criteria ? - What are the characteristics of movies/directors I like the most ? - What is the probability that I will like my next movie ? - Can I find the data ? One of the objectives of sharing my work here is to find cinephile data-scientists who might be interested and, hopefully, contribute or share insights :) Other interesting leads : use tagline for NLP/Clustering/Genre guessing, leverage on budget/revenue, link with other data sources using the imdb normalized title, etc. ![Correlation matrix][3] ### Motivation, Disclaimer and Acknowledgements - I've graduated from an french engineering school, majoring in artificial intelligence, but that was 17 years ago right in the middle of A.I-winter. Like a lot of white male rocket scientists, I've ended up in one of the leading european investment bank, quickly abandonning IT development to specialize in trading/risk project management and internal politics. My recent appointment in the Data Office made me aware of recent breakthroughts in datascience, and I thought that developing a side project would be an excellent occasion to learn something new. Plus it'd give me a well-needed credibility which too often lack decision makers when it comes to datascience. - I've worked on some of the features with Cédric Paternotte, a fellow friend of mine who is a professor of philosophy of sciences in La Sorbonne. Working with someone with a different background seem a good idea for motivation, creativity and rigor. - Kudos to www.themoviedb.org or www.wikipedia.com sites, who really have a great attitude towards open data. This is typically NOT the case of modern-bigdata companies who mostly keep data to themselves to try to monetize it. Such a huge contrast with imdb or instagram API, which generously let you grab your last 3 comments at a miserable rate. Even if 15 years ago this seemed a mandatory path to get services for free, I predict one day governments will need to break this data monopoly. *[Disclaimer : I apologize in advance for my engrish (I'm french ^-^), any bad-code I've written (there are probably hundreds of way to do it better and faster), any pseudo-scientific assumption I've made, I'm slowly getting back in statistics and lack senior guidance, one day I regress a non-stationary time series and the day after I'll discover I shouldn't have, and any incorrect use of machine-learning models]* ![powered by themoviedb.org][4] [1]: https://img11.hostingpics.net/pics/117765networkgraph.png [2]: https://img11.hostingpics.net/pics/340226westerns.png [3]: https://img11.hostingpics.net/pics/977004matrice.png [4]: https://img11.hostingpics.net/pics/898068408x161poweredbyrectanglegreen.png

我热爱电影。我倾向于避免漫威变形金刚标准化产品,而偏好经典好莱坞黄金时代和神秘波兰艺术电影的大杂烩。偶尔会加入一部日本僵尸恐怖片或意大利惊悚片以作为借口。没有坏电影,就没有好电影。 平均每年观看200多部电影,高峰期超过500部。九年前,我开始记录我所观看的电影,以避免重复观看同一部电影,并为之评分。多年来,这让我对观影习惯有了几点洞见,但这些洞见不过是一个中学生在学校能学到的东西。 最近,我订阅了Netflix,看到像我这样的人,即那些主要信奉“作者论”的人在全球推荐系统中的低效性,让我感到痛苦。这个术语是由著名新浪潮法国电影评论家安德烈·巴赞提出的,意味着电影的质量本质上与导演及其与团队执行其愿景的能力相关。我们可以就这取决于电影制作流程进行辩论,但在此我们不深入讨论。实际上,这意味着我主要观看那些我喜欢的导演的电影。 我怀疑Netflix在调整其推荐模型时,考虑了“普通大众”选择电影的方式。几个月前,我阅读了一项基于调查的研究,显示人们选择电影主要基于类型(55%),其次是主要演员(45%)。导演或上映日期则远远落后,分别约为10%。这并不奇怪,因为我所认识的大多数人都不关心导演是谁。许多美国大片甚至不在电影海报上提及它。我清楚协同过滤是基于用户相似度,我认为这减少了(甚至消除了)对电影进行特征化的需求。因此,我对基于内容的过滤更感兴趣,因为它基于产品相似度,有以下几个原因: - 用户口味不易获取。毕竟,这是Netflix的宝藏。 - Netflix上针对喜欢作者电影的观众的影片供应非常糟糕,这无助于推荐。 - 建模电影内在品质是一项有趣的挑战。 到此为止。 “取得成功的秘诀在于开始行动。”(马克·吐温) ![网络图][1] ### 内容 主要来源是www.themoviedb.org。如果您观看神秘的罗马尼亚艺术电影,您可能会发现只有95%的电影被引用...但对于其他人来说,应该有98%以上的覆盖率。 - 电影详情来自www.themoviedb.org API:movies/details。 - 电影制作团队和演员阵容来自www.themoviedb.org API:movies/credits。 - 它们可以通过id连接。 - 它们包含从19世纪末到2017年8月的所有350k部电影。如果您从IMDb中删除短片,您会得到类似数量的电影。 - 我已将检索增量电影详情的程序上传到github:https://github.com/stephanerappeneau/scienceofmovies/tree/master/PycharmProjects/GetAllMovies(需要从themoviedb.org获取开发API密钥)。 - 我尝试了在各种讨论中描述的监督(决策树)/无监督(聚类,NLP)方法,源代码在github上:https://github.com/stephanerappeneau/scienceofmovies。 - 作为附加内容,我上传了来自维基百科上顶级500位评论家导演的生物摘要,用于一些有趣的NLTK分析。 以下是我已经尝试的可用的来源概述: **• Imdb.com免费csv存档**(ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/temporaryaccess/)文档不完整,不完整,结构松散,无法连接/合并。亚马逊网络服务托管了一个API:每100000次请求1欧元。对于大约1000万部电影来说,这可能变得非常昂贵,而且功能也很基础。所以我一直在寻找其他来源。 **• www.themoviedb.org**基于众包,拥有一个出色的API,每10秒限制40个请求。它非常慷慨,文档齐全,几天内就可以清理450000部电影。对我来说,数据质量与IMDb相当,并且由于IMDb密钥也包含在其中,以后总有补充数据集的可能性(我实际上就是这样做的)。 **• www.Boxofficemojo.com**有一些有趣的预算/收入数据(这在IMDb和tmdb中都严重缺乏),但实际上它只跟踪几千部电影,主要是大片。还有其他用于为电影行业获取更好的预测/营销洞察力的专业来源,但那超出了这次实验的范围。 **• www.wikipedia.com**是一个有趣的来源,没有API调用上限,但它需要一些网络抓取,对于电影或导演,其布局和质量差异很大,所以我怀疑要获得洞察力需要大量工作,所以我将这个来源放在较低优先级。 **• www.google.com**会在几分钟的网页抓取后禁止您,因为他们的工作是从其他人那里抓取数据,然后出售它,呃。 • 值得一提的是,kaggle上有几个Netflix匿名用户喜好的存档,因为他们组织了一些竞赛来改进他们的推荐模型。https://www.kaggle.com/netflix-inc/netflix-prize-data • 在线数据库在很大程度上是以白人和盎格鲁-撒克逊为中心的,这意味着宝莱坞(印度是第二大电影制作国)的供应主要缺失于数据集中。我对这一点并不介意,因为那不是我的菜,而且我缺乏领域知识。印度电影的巨大数量可能会扭曲我的结果(我不希望我的推荐中包含太多的武术音乐剧);然而,我对印度电影产业充满敬意,所以我愿意与一位印度电影爱好者合作! ![西部电影][2] ### 灵感 从这里开始,我为监督/无监督机器学习提出了多个问题: - 我能否根据我的自身标准编程一个定制的推荐系统? - 我喜欢的电影/导演的特点是什么? - 我喜欢下一部电影的概率是多少? - 我能否找到这些数据? 分享我的工作的一个目标就是找到对电影感兴趣的数据科学家,并希望他们能做出贡献或分享见解。其他有趣的线索:使用标语进行NLP/聚类/类型猜测,利用预算/收入,使用IMDb规范标题与其他数据源进行链接等。 ![相关矩阵][3] ### 动机、免责声明和致谢 - 我毕业于一所法国工程学校,主修人工智能,但那是17年前,正值人工智能寒冬。像许多白人男性火箭科学家一样,我最终成为了一家领先的欧洲投资银行,很快就放弃了IT开发,专注于交易/风险管理内部政治。最近我在数据中心的工作让我意识到了数据科学领域的最新突破,我认为开发一个副项目是学习新知识的一个极好的机会。此外,它还会给我带来急需的信誉,因为在数据科学方面,决策者往往缺乏这种信誉。 - 我与我的朋友Cédric Paternotte合作了一些功能,他是索邦大学科学哲学教授。与有着不同背景的人合作似乎是激发动力、创造力和严谨的好主意。 - 向www.themoviedb.org或www.wikipedia.com网站致以敬意,它们对开放数据的态度非常好。这与现代大数据公司通常将数据保留给自己以尝试货币化的做法形成了鲜明的对比。与慷慨允许您以微不足道的速率抓取您最后3条评论的imdb或instagram API形成了巨大反差。即使15年前这似乎是免费获取服务的一种必经之路,但我预测有一天政府将需要打破这种数据垄断。 *[免责声明:我提前为我的英语(我是法国人^_^)表示歉意,我写的任何糟糕的代码(可能有很多更好的和更快的方法来做),我做出的任何伪科学的假设,我正在慢慢回到统计学,缺乏高级指导,有一天我会回归非平稳时间序列,第二天我会发现我不应该这样做,以及任何不正确的机器学习模型的使用]* ![由themoviedb.org提供支持][4]
提供机构:
Kaggle
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作