TheGreatRambler/mm2_level_comments

Name: TheGreatRambler/mm2_level_comments
Creator: TheGreatRambler
Published: 2022-11-11 08:06:48
License: 暂无描述

Hugging Face2022-11-11 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/TheGreatRambler/mm2_level_comments

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - multilingual license: - cc-by-nc-sa-4.0 multilinguality: - multilingual size_categories: - 10M<n<100M source_datasets: - original task_categories: - other - object-detection - text-retrieval - token-classification - text-generation task_ids: [] pretty_name: Mario Maker 2 level comments tags: - text-mining --- # Mario Maker 2 level comments Part of the [Mario Maker 2 Dataset Collection](https://tgrcode.com/posts/mario_maker_2_datasets) ## Dataset Description The Mario Maker 2 level comment dataset consists of 31.9 million level comments from Nintendo's online service totaling around 20GB of data. The dataset was created using the self-hosted [Mario Maker 2 api](https://tgrcode.com/posts/mario_maker_2_api) over the course of 1 month in February 2022. ### How to use it The Mario Maker 2 level comment dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of `datasets`. You can load and iterate through the dataset with the following code: ```python from datasets import load_dataset ds = load_dataset("TheGreatRambler/mm2_level_comments", streaming=True, split="train") print(next(iter(ds))) #OUTPUT: { 'data_id': 3000006, 'comment_id': '20200430072710528979_302de3722145c7a2_2dc6c6', 'type': 2, 'pid': '3471680967096518562', 'posted': 1561652887, 'clear_required': 0, 'text': '', 'reaction_image_id': 10, 'custom_image': [some binary data], 'has_beaten': 0, 'x': 557, 'y': 64, 'reaction_face': 0, 'unk8': 0, 'unk10': 0, 'unk12': 0, 'unk14': [some binary data], 'unk17': 0 } ``` Comments can be one of three types: text, reaction image or custom image. `type` can be used with the enum below to identify different kinds of comments. Custom images are binary PNGs. You can also download the full dataset. Note that this will download ~20GB: ```python ds = load_dataset("TheGreatRambler/mm2_level_comments", split="train") ``` ## Data Structure ### Data Instances ```python { 'data_id': 3000006, 'comment_id': '20200430072710528979_302de3722145c7a2_2dc6c6', 'type': 2, 'pid': '3471680967096518562', 'posted': 1561652887, 'clear_required': 0, 'text': '', 'reaction_image_id': 10, 'custom_image': [some binary data], 'has_beaten': 0, 'x': 557, 'y': 64, 'reaction_face': 0, 'unk8': 0, 'unk10': 0, 'unk12': 0, 'unk14': [some binary data], 'unk17': 0 } ``` ### Data Fields |Field|Type|Description| |---|---|---| |data_id|int|The data ID of the level this comment appears on| |comment_id|string|Comment ID| |type|int|Type of comment, enum below| |pid|string|Player ID of the comment creator| |posted|int|UTC timestamp of when this comment was created| |clear_required|bool|Whether this comment requires a clear to view| |text|string|If the comment type is text, the text of the comment| |reaction_image_id|int|If this comment is a reaction image, the id of the reaction image, enum below| |custom_image|bytes|If this comment is a custom drawing, the custom drawing as a PNG binary| |has_beaten|int|Whether the user had beaten the level when they created the comment| |x|int|The X position of the comment in game| |y|int|The Y position of the comment in game| |reaction_face|int|The reaction face of the mii of this user, enum below| |unk8|int|Unknown| |unk10|int|Unknown| |unk12|int|Unknown| |unk14|bytes|Unknown| |unk17|int|Unknown| ### Data Splits The dataset only contains a train split. ## Enums The dataset contains some enum integer fields. This can be used to convert back to their string equivalents: ```python CommentType = { 0: "Custom Image", 1: "Text", 2: "Reaction Image" } CommentReactionImage = { 0: "Nice!", 1: "Good stuff!", 2: "So tough...", 3: "EASY", 4: "Seriously?!", 5: "Wow!", 6: "Cool idea!", 7: "SPEEDRUN!", 8: "How?!", 9: "Be careful!", 10: "So close!", 11: "Beat it!" } CommentReactionFace = { 0: "Normal", 16: "Wink", 1: "Happy", 4: "Surprised", 18: "Scared", 3: "Confused" } ```  ## Dataset Creation The dataset was created over a little more than a month in Febuary 2022 using the self hosted [Mario Maker 2 api](https://tgrcode.com/posts/mario_maker_2_api). As requests made to Nintendo's servers require authentication the process had to be done with upmost care and limiting download speed as to not overload the API and risk a ban. There are no intentions to create an updated release of this dataset. ## Considerations for Using the Data The dataset consists of comments from many different Mario Maker 2 players globally and as such their text could contain harmful language. Harmful depictions could also be present in the custom images.

提供机构：

TheGreatRambler

原始信息汇总

数据集概述

基本信息

名称: Mario Maker 2 level comments
语言: 多语言
许可证: cc-by-nc-sa-4.0
多语言性: 多语言
大小: 10M<n<100M
来源: 原始数据
任务类别: 其他、目标检测、文本检索、令牌分类、文本生成
标签: 文本挖掘

数据集描述

内容: 包含31.9百万条来自Nintendo在线服务的关卡评论，总数据量约20GB。
采集时间: 2022年2月，历时约1个月。

数据结构

数据实例: python { data_id: int, comment_id: string, type: int, pid: string, posted: int, clear_required: bool, text: string, reaction_image_id: int, custom_image: bytes, has_beaten: int, x: int, y: int, reaction_face: int, unk8: int, unk10: int, unk12: int, unk14: bytes, unk17: int }

数据字段:

字段	类型	描述
data_id	int	关卡数据ID
comment_id	string	评论ID
type	int	评论类型
pid	string	评论创建者玩家ID
posted	int	评论创建的UTC时间戳
clear_required	bool	是否需要通关才能查看评论
text	string	文本评论内容
reaction_image_id	int	反应图像ID
custom_image	bytes	自定义图像内容
has_beaten	int	用户创建评论时是否已通关
x, y	int	游戏中评论的位置
reaction_face	int	用户Mii的反应表情
unk8, unk10, unk12, unk14, unk17	int/bytes	未知字段

数据使用

加载方式: 推荐使用datasets库的流式API进行加载和迭代。
下载: 可下载完整数据集，约20GB。

数据集创建

采集方法: 使用自托管的Mario Maker 2 API进行数据采集。
注意事项: 数据采集过程中需谨慎，以避免对Nintendo服务器造成过载。

使用考虑

内容风险: 评论可能包含有害语言，自定义图像可能包含不当内容。

5,000+

优质数据集

54 个

任务类型

进入经典数据集