TheGreatRambler/mm2_level_comments
收藏Hugging Face2022-11-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/TheGreatRambler/mm2_level_comments
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- multilingual
license:
- cc-by-nc-sa-4.0
multilinguality:
- multilingual
size_categories:
- 10M<n<100M
source_datasets:
- original
task_categories:
- other
- object-detection
- text-retrieval
- token-classification
- text-generation
task_ids: []
pretty_name: Mario Maker 2 level comments
tags:
- text-mining
---
# Mario Maker 2 level comments
Part of the [Mario Maker 2 Dataset Collection](https://tgrcode.com/posts/mario_maker_2_datasets)
## Dataset Description
The Mario Maker 2 level comment dataset consists of 31.9 million level comments from Nintendo's online service totaling around 20GB of data. The dataset was created using the self-hosted [Mario Maker 2 api](https://tgrcode.com/posts/mario_maker_2_api) over the course of 1 month in February 2022.
### How to use it
The Mario Maker 2 level comment dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of `datasets`. You can load and iterate through the dataset with the following code:
```python
from datasets import load_dataset
ds = load_dataset("TheGreatRambler/mm2_level_comments", streaming=True, split="train")
print(next(iter(ds)))
#OUTPUT:
{
'data_id': 3000006,
'comment_id': '20200430072710528979_302de3722145c7a2_2dc6c6',
'type': 2,
'pid': '3471680967096518562',
'posted': 1561652887,
'clear_required': 0,
'text': '',
'reaction_image_id': 10,
'custom_image': [some binary data],
'has_beaten': 0,
'x': 557,
'y': 64,
'reaction_face': 0,
'unk8': 0,
'unk10': 0,
'unk12': 0,
'unk14': [some binary data],
'unk17': 0
}
```
Comments can be one of three types: text, reaction image or custom image. `type` can be used with the enum below to identify different kinds of comments. Custom images are binary PNGs.
You can also download the full dataset. Note that this will download ~20GB:
```python
ds = load_dataset("TheGreatRambler/mm2_level_comments", split="train")
```
## Data Structure
### Data Instances
```python
{
'data_id': 3000006,
'comment_id': '20200430072710528979_302de3722145c7a2_2dc6c6',
'type': 2,
'pid': '3471680967096518562',
'posted': 1561652887,
'clear_required': 0,
'text': '',
'reaction_image_id': 10,
'custom_image': [some binary data],
'has_beaten': 0,
'x': 557,
'y': 64,
'reaction_face': 0,
'unk8': 0,
'unk10': 0,
'unk12': 0,
'unk14': [some binary data],
'unk17': 0
}
```
### Data Fields
|Field|Type|Description|
|---|---|---|
|data_id|int|The data ID of the level this comment appears on|
|comment_id|string|Comment ID|
|type|int|Type of comment, enum below|
|pid|string|Player ID of the comment creator|
|posted|int|UTC timestamp of when this comment was created|
|clear_required|bool|Whether this comment requires a clear to view|
|text|string|If the comment type is text, the text of the comment|
|reaction_image_id|int|If this comment is a reaction image, the id of the reaction image, enum below|
|custom_image|bytes|If this comment is a custom drawing, the custom drawing as a PNG binary|
|has_beaten|int|Whether the user had beaten the level when they created the comment|
|x|int|The X position of the comment in game|
|y|int|The Y position of the comment in game|
|reaction_face|int|The reaction face of the mii of this user, enum below|
|unk8|int|Unknown|
|unk10|int|Unknown|
|unk12|int|Unknown|
|unk14|bytes|Unknown|
|unk17|int|Unknown|
### Data Splits
The dataset only contains a train split.
## Enums
The dataset contains some enum integer fields. This can be used to convert back to their string equivalents:
```python
CommentType = {
0: "Custom Image",
1: "Text",
2: "Reaction Image"
}
CommentReactionImage = {
0: "Nice!",
1: "Good stuff!",
2: "So tough...",
3: "EASY",
4: "Seriously?!",
5: "Wow!",
6: "Cool idea!",
7: "SPEEDRUN!",
8: "How?!",
9: "Be careful!",
10: "So close!",
11: "Beat it!"
}
CommentReactionFace = {
0: "Normal",
16: "Wink",
1: "Happy",
4: "Surprised",
18: "Scared",
3: "Confused"
}
```
<!-- TODO create detailed statistics -->
## Dataset Creation
The dataset was created over a little more than a month in Febuary 2022 using the self hosted [Mario Maker 2 api](https://tgrcode.com/posts/mario_maker_2_api). As requests made to Nintendo's servers require authentication the process had to be done with upmost care and limiting download speed as to not overload the API and risk a ban. There are no intentions to create an updated release of this dataset.
## Considerations for Using the Data
The dataset consists of comments from many different Mario Maker 2 players globally and as such their text could contain harmful language. Harmful depictions could also be present in the custom images.
提供机构:
TheGreatRambler
原始信息汇总
数据集概述
基本信息
- 名称: Mario Maker 2 level comments
- 语言: 多语言
- 许可证: cc-by-nc-sa-4.0
- 多语言性: 多语言
- 大小: 10M<n<100M
- 来源: 原始数据
- 任务类别: 其他、目标检测、文本检索、令牌分类、文本生成
- 标签: 文本挖掘
数据集描述
- 内容: 包含31.9百万条来自Nintendo在线服务的关卡评论,总数据量约20GB。
- 采集时间: 2022年2月,历时约1个月。
数据结构
-
数据实例: python { data_id: int, comment_id: string, type: int, pid: string, posted: int, clear_required: bool, text: string, reaction_image_id: int, custom_image: bytes, has_beaten: int, x: int, y: int, reaction_face: int, unk8: int, unk10: int, unk12: int, unk14: bytes, unk17: int }
-
数据字段:
字段 类型 描述 data_id int 关卡数据ID comment_id string 评论ID type int 评论类型 pid string 评论创建者玩家ID posted int 评论创建的UTC时间戳 clear_required bool 是否需要通关才能查看评论 text string 文本评论内容 reaction_image_id int 反应图像ID custom_image bytes 自定义图像内容 has_beaten int 用户创建评论时是否已通关 x, y int 游戏中评论的位置 reaction_face int 用户Mii的反应表情 unk8, unk10, unk12, unk14, unk17 int/bytes 未知字段
数据使用
- 加载方式: 推荐使用
datasets库的流式API进行加载和迭代。 - 下载: 可下载完整数据集,约20GB。
数据集创建
- 采集方法: 使用自托管的Mario Maker 2 API进行数据采集。
- 注意事项: 数据采集过程中需谨慎,以避免对Nintendo服务器造成过载。
使用考虑
- 内容风险: 评论可能包含有害语言,自定义图像可能包含不当内容。



