# MyAnimeList Database 2020
> Recommendation data from 320.0000 users and 16.000 animes at myanimelist.net
This dataset contains information about 17.562 anime and the preference from 325.772 different users. In particular, this dataset contain:
- The anime list per user. Include dropped, complete, plan to watch, currently watching and on hold.
- Ratings given by users to the animes that they has watched completely.
- Information about the anime like genre, stats, studio, etc.
- HTML with anime information to do data scrapping. These files contain information such as reviews, synopsis, information about the staff, anime statistics, genre, etc.
Also, the code used to collect the data is available at github: https://github.com/Hernan4444/MyAnimeList-Database.
### Warning: this dataset includes information about anime for adults (hentai).
## Content
**The data was scrapped between February 26th and March 20th.**
* The "html" folder contain 1 zip per anime (17.562 different anime). Each zip contains different HTML pages scrapped from [MyAnimeList](https://myanimelist.net/). The scrapped pages are:
1. Main page
2. Reviews
3. Recommendations
4. Stats
5. Characters & Staff
I uploaded 2 files as example to don't increase the size of this dataset. All HTML files are in this link: https://drive.google.com/drive/folders/12ghJk-sWyXXORoLBUpPirK4YdtIaZPV_?usp=sharing
* `animelist.csv` have the list of all animes register by the user with the respective score, watching status and numbers of episodes watched. This dataset contains 109 Million row, 17.562 different animes and 325.772 different users. The file have the following columns:
1. user_id: non identifiable randomly generated user id.
2. anime_id: MyAnimeList ID of the anime. (e.g. 1).
3. score: score between 1 to 10 given by the user. 0 if the user didn't assign a score. (e.g. 10)
4. watching_status: state ID from this anime in the anime list of this user. (e.g. 2)
5. watched_episodes: numbers of episodes watched by the user. (e.g. 24)
* `watching_status.csv` describe every possible status of the column: "watching_status" in `animelist.csv`.
* `rating_complete.csv` is a subset of `animelist.csv`. This dataset only considers animes that the user has watched completely (`watching_status==2`) and gave it a score (`score!=0`). This dataset contains 57 Million ratings applied to 16.872 animes by 310.059 users. This file have the following columns:
1. user_id: non identifiable randomly generated user id.
2. anime_id: - MyAnimelist ID of the anime that this user has rated.
3. rating: rating that this user has assigned.
* `anime.csv` contain general information of every anime (17.562 different anime) like genre, stats, studio, etc. This file have the following columns:
1. MAL_ID: MyAnimelist ID of the anime. (e.g. 1)
2. Name: full name of the anime. (e.g. Cowboy Bebop)
3. Score: average score of the anime given from all users in MyAnimelist database. (e.g. 8.78)
4. Genres: comma separated list of genres for this anime. (e.g. Action, Adventure, Comedy, Drama, Sci-Fi, Space)
5. English name: full name in english of the anime. (e.g. Cowboy Bebop)
6. Japanese name: full name in japanses of the anime. (e.g. カウボーイビバップ)
7. Type: TV, movie, OVA, etc. (e.g. TV)
8. Episodes': number of chapters. (e.g. 26)
9. Aired: broadcast date. (e.g. Apr 3, 1998 to Apr 24, 1999)
10. Premiered: season premiere. (e.g. Spring 1998)
11. Producers: comma separated list of produducers (e.g. Bandai Visual)
12. Licensors: comma separated list of licensors (e.g. Funimation, Bandai Entertainment)
13. Studios: comma separated list of studios (e.g. Sunrise)
14. Source: Manga, Light novel, Book, etc. (e.g Original)
15. Duration: duration of the anime per episode (e.g 24 min. per ep.)
16. Rating: age rate (e.g. R - 17+ (violence & profanity))
17. Ranked: position based in the score. (e.g 28)
18. Popularity: position based in the the number of users who have added the anime to their list. (e.g 39)
19. Members: number of community members that are in this anime's "group". (e.g. 1251960)
20. Favorites: number of users who have the anime as "favorites". (e.g. 61,971)
21. Watching: number of users who are watching the anime. (e.g. 105808)
22. Completed: number of users who have complete the anime. (e.g. 718161)
23. On-Hold: number of users who have the anime on Hold. (e.g. 71513)
24. Dropped: number of users who have dropped the anime. (e.g. 26678)
25. Plan to Watch': number of users who plan to watch the anime. (e.g. 329800)
26. Score-10': number of users who scored 10. (e.g. 229170)
27. Score-9': number of users who scored 9. (e.g. 182126)
28. Score-8': number of users who scored 8. (e.g. 131625)
29. Score-7': number of users who scored 7. (e.g. 62330)
30. Score-6': number of users who scored 6. (e.g. 20688)
31. Score-5': number of users who scored 5. (e.g. 8904)
32. Score-4': number of users who scored 4. (e.g. 3184)
33. Score-3': number of users who scored 3. (e.g. 1357)
34. Score-2': number of users who scored 2. (e.g. 741)
35. Score-1': number of users who scored 1. (e.g. 1580)
## Acknowledgements
Thanks to:
1. [MyAnimeList](https://myanimelist.net/) for providing anime data.
2. [Jikan API](https://jikan.docs.apiary.io/) for provide users preference.
3. Pontificia Universidad Católica de Chile for provide servers to run the code.
## Inspiration
1. Have an HTML files to experience the scraping exercise without the delay of each requests.
2. Experiment with different types of recommended. For instance, collaborative filtering or based on context like stats, genre, seiyus, reviews, synopsis, etc.
3. Use this information to build a better anime recommended system.
4. Identifying which feature allows us to build the best anime recommended system.
## Ideas to the future
1. Build the same dataset with manga and novel.
{'- HTML with anime information to do data scrapping. These files contain information such as reviews, synopsis, information about the staff, anime statistics, genre, etc.': '- 包含动画信息的HTML文件,用于数据抓取。这些文件包含评论、简介、工作人员信息、动画统计数据、类型等。', "3. score: score between 1 to 10 given by the user. 0 if the user didn't assign a score. (e.g. 10)": '3. score:用户给出的1至10的评分。如用户未评分,则为0。(例如:10)', '2. Reviews': '2. 评论', '# MyAnimeList Database 2020': '《MyAnimeList数据库2020》', '3. Recommendations': '3. 推荐', '4. watching_status: state ID from this anime in the anime list of this user. (e.g. 2)': '4. watching_status:此动画在用户列表中的状态ID。(例如:2', 'This dataset contains information about 17.562 anime and the preference from 325.772 different users. In particular, this dataset contain': '本数据集涵盖了17,562部动画及其325,772位不同用户偏好的信息,具体包括', '> Recommendation data from 320.0000 users and 16.000 animes at myanimelist.net': '该数据集收录了来自myanimelist.net网站320,000名用户对16,000部动画的推荐数据', '5. Characters & Staff': '5. 角色及工作人员', '* `animelist.csv` have the list of all animes register by the user with the respective score, watching status and numbers of episodes watched. This dataset contains 109 Million row, 17.562 different animes and 325.772 different users. The file have the following columns': '* `animelist.csv` 包含所有由用户注册的动画列表,包括相应的评分、观看状态和观看的集数。该数据集包含1.09亿行,17.562部不同的动画和325,772位不同的用户。文件包含以下列', '## Content': '## 数据内容', '**The data was scrapped between February 26th and March 20th.**': '**数据收集时间介于2020年2月26日至3月20日之间。**', '1. user_id: non identifiable randomly generated user id.': '1. user_id:不可识别的随机生成的用户ID。', '2. anime_id: MyAnimeList ID of the anime. (e.g. 1).': '2. anime_id:动画的MyAnimeList ID。(例如:1)', '4. Stats': '4. 统计数据', '* The "html" folder contain 1 zip per anime (17.562 different anime). Each zip contains different HTML pages scrapped from [MyAnimeList](https://myanimelist.net/). The scrapped pages are': '* “html”文件夹中包含每个动画(共计17,562部)的一个压缩包。每个压缩包包含从[MyAnimeList](https://myanimelist.net/)抓取的不同HTML页面。抓取的页面包括', '- Ratings given by users to the animes that they has watched completely.': '- 用户对其已完整观看的动画所给予的评价。', '### Warning: this dataset includes information about anime for adults (hentai).': '### 注意:本数据集包含成人向动画(hentai)的信息。', '1. Main page': '1. 主页', '- The anime list per user. Include dropped, complete, plan to watch, currently watching and on hold.': '- 每位用户的动画列表,包括已放弃、已完成、计划观看、正在观看和暂停观看的状态。', "I uploaded 2 files as example to don't increase the size of this dataset. All HTML files are in this link: https://drive.google.com/drive/folders/12ghJk-sWyXXORoLBUpPirK4YdtIaZPV_?usp=sharing": '为避免增加数据集的体积,我上传了2个示例文件。所有HTML文件可在此链接找到:https://drive.google.com/drive/folders/12ghJk-sWyXXORoLBUpPirK4YdtIaZPV_?usp=sharing', 'The file have the following columns': '文件包含以下列', '- Information about the anime like genre, stats, studio, etc.': '- 关于动画的信息,如类型、统计数据、制作公司等。', '5. watched_episodes: numbers of episodes watched by the user. (e.g. 24)': '5. watched_episodes:用户观看的集数。(例如:24)', 'Also, the code used to collect the data is available at github: https://github.com/Hernan4444/MyAnimeList-Database.': '此外,收集数据的代码可在GitHub上获取:https://github.com/Hernan4444/MyAnimeList-Database。'}