VTuber 1B: Live Chat and Moderation Statistics
收藏www.kaggle.com2022-08-04 更新2025-03-25 收录
下载链接:
https://www.kaggle.com/uetchy/vtuber-livechat
下载链接
链接失效反馈官方服务:
资源简介:
**VTuber 1B** is a dataset for large-scale academic research, collecting over a billion live chats, superchats, and moderation events (bans/deletions) from virtual YouTubers' live streams.
See [GitHub](https://github.com/sigvt/vtuber-livechat-dataset) and join `#livechat-dataset` channel on [SIGVT Discord](https://sigvt.org/discord) for discussions.
> We also offer [❤️🩹 Sensai](https://github.com/sigvt/sensai-dataset), a live chat dataset specifically made for building ML models for spam detection / toxic chat classification.
## Provenance
- **Source:** YouTube live chat events collected by our [Honeybee](https://github.com/sigvt/honeybee) cluster. [Holodex](https://holodex.net) is a stream index provider for Honeybee which covers Hololive, Nijisanji, 774inc, etc.
- **Temporal Coverage:**
- Chats: from 2021-01-15
- Super chats: from 2021-03-16
- **Update Frequency:**
- At least once every 6 months
## Research Ideas
- Toxic Chat Classification
- Spam Detection
- Demographic Visualization
- Superchat Analysis
- Training neural language models
See public notebooks built on [VTuber 1B](https://www.kaggle.com/uetchy/vtuber-livechat/code) and [VTuber 1B Elements](https://www.kaggle.com/uetchy/vtuber-livechat-elements/code) for ideas.
> We employed [Honeybee](https://github.com/sigvt/honeybee) cluster to collect real-time live chat events across major Vtubers' live streams. All sensitive data such as author name or author profile image are omitted from the dataset, and author channel id is anonymized by SHA-1 hashing algorithm with a grain of salt.
## Editions
### VTuber 1B Elements
[Kaggle Datasets](https://www.kaggle.com/uetchy/vtuber-livechat-elements) (2 MB)
VTuber 1B Elements is most suitable for statistical visualizations and explanatory data analysis.
| filename | summary |
| --------------------- | --------------------- |
| `channels.csv` | Channel index |
| `chat_stats.csv` | Chat statistics |
| `superchat_stats.csv` | Super Chat statistics |
### VTuber 1B
[Kaggle Datasets](https://www.kaggle.com/uetchy/vtuber-livechat)
VTuber 1B is most suitable for frequency analysis. This edition includes only the essential columns in order to reduce dataset size and make it faster fro Kaggle Kernels to load data in.
| filename | summary |
| -------------------------- | ---------------------------------- |
| `chats_%Y-%m.parquet` | Live chat events (> 1,000,000,000) |
| `superchats_%Y-%m.parquet` | Super chat events (> 4,000,000) |
| `deletion_events.parquet` | Deletion events |
| `ban_events.parquet` | Ban events |
## Dataset Breakdown
> Ban and deletion are equivalent to `markChatItemsByAuthorAsDeletedAction` and `markChatItemAsDeletedAction` respectively.
### Chats (`chats_%Y-%m.csv`)
| column | type | description | in standard version |
| --------------- | ---------------- | --------------------------- | --------------------- |
| timestamp | string | ISO 8601 UTC timestamp | limited accuracy |
| id | string | chat id | N/A |
| authorName | string | author name | N/A |
| authorChannelId | string | author channel id | anonymized |
| body | string | chat message | N/A |
| bodyLength | number | chat message length | standard version only |
| membership | string | membership status | N/A |
| isMember | nullable boolean | is member (null if unknown) | standard version only |
| isModerator | boolean | is channel moderator | N/A |
| isVerified | boolean | is verified account | N/A |
| videoId | string | source video id | |
| channelId | string | source channel id | |
#### Membership status
| value | duration |
| ---------- | ------------------------- |
| unknown | Indistinguishable |
| non-member | 0 |
| new | < 1 month |
| 1 month | >= 1 month, < 2 months |
| 2 months | >= 2 months, < 6 months |
| 6 months | >= 6 months, < 12 months |
| 1 year | >= 12 months, < 24 months |
| 2 years | >= 24 months |
#### Pandas usage
Set `keep_default_na` to `False` and `na_values` to `''` in `read_csv`. Otherwise, chat message like `NA` would incorrectly be treated as NaN value.
```python
chats = pd.read_parquet('../input/vtuber-livechat/chats_2021-03.parquet')
```
### Superchats (`chats_:year:-:month:.csv`)
| column | type | description | in standard version |
| --------------- | --------------- | ---------------------------- | ------------------- |
| timestamp | string | ISO 8601 UTC timestamp | limited accuracy |
| id | string | chat id | N/A |
| authorName | string | author name | N/A |
| authorChannelId | string | author channel id | anonymized |
| body | nullable string | chat message | N/A |
| amount | number | purchased amount | |
| currency | string | three-letter currency symbol | |
| color | string | color | N/A |
| significance | number | significance | |
| videoId | string | source video id | N/A |
| channelId | string | source channel id | |
#### Color and Significance
| color | significance | purchase amount (¥) | purchase amount ($) | max. message length |
| --------- | ------------ | ------------------- | ------------------- | ------------------- |
| blue | 1 | ¥ 100 - 199 | $ 1.00 - 1.99 | 0 |
| lightblue | 2 | ¥ 200 - 499 | $ 2.00 - 4.99 | 50 |
| green | 3 | ¥ 500 - 999 | $ 5.00 - 9.99 | 150 |
| yellow | 4 | ¥ 1000 - 1999 | $ 10.00 - 19.99 | 200 |
| orange | 5 | ¥ 2000 - 4999 | $ 20.00 - 49.99 | 225 |
| magenta | 6 | ¥ 5000 - 9999 | $ 50.00 - 99.99 | 250 |
| red | 7 | ¥ 10000 - 50000 | $ 100.00 - 500.00 | 270 - 350 |
#### Pandas usage
```python
import pandas as pd
from glob import iglob
sc = pd.concat([
pd.read_parquet(f)
for f in iglob('../input/vtuber-livechat/superchats_*.parquet')
],
ignore_index=False)
sc.sort_index(inplace=True)
```
### Deletion Events (`deletion_events.csv`)
| column | type | description | in standard version |
| --------- | ------- | ---------------------------- | ------------------- |
| timestamp | string | UTC timestamp | |
| id | string | chat id | |
| retracted | boolean | is deleted by author oneself | |
| videoId | string | source video id | |
| channelId | string | source channel id | |
#### Pandas usage
Insert `deleted_by_mod` column to `chats` DataFrame:
```python
chats = pd.read_parquet('../input/vtuber-livechat/chats_2021-03.parquet')
delet = pd.read_parquet('../input/vtuber-livechat/deletion_events.parquet')
delet = delet[delet['retracted'] == 0]
delet['deleted_by_mod'] = True
chats = pd.merge(chats, delet[['id', 'deleted_by_mod']], how='left')
chats['deleted_by_mod'].fillna(False, inplace=True)
```
### Ban Events (`ban_events.csv`)
Here **Ban** means either to place user in time out or to permanently hide the user's comments on the channel's current and future live streams. This mixup is due to the fact that these actions are indistinguishable from others with the extracted data from `markChatItemsByAuthorAsDeletedAction` event.
| column | type | description | in standard version |
| --------------- | ------ | ----------------- | ------------------- |
| timestamp | string | UTC timestamp | |
| authorChannelId | string | channel id | anonymized |
| videoId | string | source video id | |
| channelId | string | source channel id | |
#### Pandas usage
Insert `banned` column to `chats` DataFrame:
```python
chats = pd.read_parquet('../input/vtuber-livechat/chats_2021-03.parquet')
ban = pd.read_parquet('../input/vtuber-livechat/ban_events.parquet')
ban['banned'] = True
chats = pd.merge(chats, ban, on=['authorChannelId', 'videoId'], how='left')
chats['banned'].fillna(False, inplace=True)
```
## Consideration
### Anonymization
`id` and `authorChannelId` are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.
### Handling Custom Emojis
All custom emojis are replaced with a Unicode replacement character � (`U+FFFD`).
### Redundant Ban and Deletion Events
Bans and deletions from multiple moderators for the same person or chat will be logged separately. For simplicity, you can safely ignore all but the first line recorded in time order.
## Citation
```latex
@misc{vtuber-livechat-dataset,
author={Yasuaki Uechi},
title={VTuber 1B: Large-scale Live Chat and Moderation Events Dataset},
year={2022},
month={2},
version={37},
url={https://sigvt.org/vtuber-1b}
}
```
## License
- Code: [MIT License](https://github.com/sigvt/vtuber-livechat-dataset/blob/master/LICENSE)
- Dataset: [ODC Public Domain Dedication and Licence (PDDL)](https://opendatacommons.org/licenses/pddl/1-0/index.html)
VTuber 1B 是一项大规模学术研究数据集,汇集了超过十亿条虚拟主播直播间的实时聊天、超级聊天以及管理事件(封禁/删除)。
查阅 [GitHub](https://github.com/sigvt/vtuber-livechat-dataset) 并加入 [SIGVT Discord](https://sigvt.org/discord) 中的 `#livechat-dataset` 频道以参与讨论。
此外,我们还提供 [❤️🩹 Sensai](https://github.com/sigvt/sensai-dataset) 数据集,这是一份专门为构建用于垃圾邮件检测/有害聊天分类的机器学习模型而设计的实时聊天数据集。
## 数据来源
- **来源**:由我们 [Honeybee](https://github.com/sigvt/honeybee) 集群收集的 YouTube 直播聊天事件。[Holodex](https://holodex.net) 是 Honeybee 的流索引提供商,涵盖了 Hololive、Nijisanji、774inc 等平台。
- **时间范围**:
- 聊天:自 2021-01-15
- 超级聊天:自 2021-03-16
- **更新频率**:至少每 6 个月更新一次。
## 研究思路
- 有害聊天分类
- 垃圾邮件检测
- 人口统计学可视化
- 超级聊天分析
- 训练神经语言模型
查看基于 [VTuber 1B](https://www.kaggle.com/uetchy/vtuber-livechat/code) 和 [VTuber 1B Elements](https://www.kaggle.com/uetchy/vtuber-livechat-elements/code) 构建的公共笔记本以获取灵感。
## 数据收集
我们利用 [Honeybee](https://github.com/sigvt/honeybee) 集群收集了主要虚拟主播直播间的实时聊天事件。数据集中省略了所有敏感数据,如作者名称或作者个人资料图片,并通过 SHA-1 哈希算法结合少量未公开的盐值对作者频道 ID 进行匿名化处理。
## 版本
### VTuber 1B Elements
[Kaggle 数据集](https://www.kaggle.com/uetchy/vtuber-livechat-elements)(2 MB)
VTuber 1B Elements 适用于统计可视化和解释性数据分析。
| 文件名 | 摘要 |
| ------ | ------ |
| `channels.csv` | 频道索引 |
| `chat_stats.csv` | 聊天统计 |
| `superchat_stats.csv` | 超级聊天统计 |
### VTuber 1B
[Kaggle 数据集](https://www.kaggle.com/uetchy/vtuber-livechat)
VTuber 1B 适用于频率分析。本版本仅包含必要的列,以减少数据集大小并加快 Kaggle Kernels 加载数据的速度。
| 文件名 | 摘要 |
| ------ | ------ |
| `chats_%Y-%m.parquet` | 直播聊天事件(> 1,000,000,000) |
| `superchats_%Y-%m.parquet` | 超级聊天事件(> 4,000,000) |
| `deletion_events.parquet` | 删除事件 |
| `ban_events.parquet` | 封禁事件 |
## 数据集分解
> 封禁和删除分别等同于 `markChatItemsByAuthorAsDeletedAction` 和 `markChatItemAsDeletedAction`。
### 聊天 (`chats_%Y-%m.csv`)
| 列 | 类型 | 描述 | 标准版本 |
| --- | --- | --- | ------ |
| timestamp | 字符串 | ISO 8601 UTC 时间戳 | 精度有限 |
| id | 字符串 | 聊天 ID | 无 |
| authorName | 字符串 | 作者名称 | 无 |
| authorChannelId | 字符串 | 作者频道 ID | 匿名化 |
| body | 字符串 | 聊天消息 | 无 |
| bodyLength | 数字 | 聊天消息长度 | 标准版本仅限 |
| membership | 字符串 | 会员状态 | 无 |
| isMember | 可空布尔值 | 是否会员(如果未知则为 null) | 标准版本仅限 |
| isModerator | 布尔值 | 是否频道管理员 | 无 |
| isVerified | 布尔值 | 是否已验证账户 | 无 |
| videoId | 字符串 | 源视频 ID | 无 |
| channelId | 字符串 | 源频道 ID | 无 |
#### 会员状态
| 值 | 持续时间 |
| --- | -------- |
| unknown | 不可区分 |
| non-member | 0 |
| new | < 1 个月 |
| 1 month | >= 1 个月, < 2 个月 |
| 2 months | >= 2 个月, < 6 个月 |
| 6 months | >= 6 个月, < 12 个月 |
| 1 year | >= 12 个月, < 24 个月 |
| 2 years | >= 24 个月 |
#### Pandas 使用
将 `keep_default_na` 设置为 `False` 并在 `read_csv` 中将 `na_values` 设置为 `''`。否则,类似 `NA` 的聊天消息会被错误地处理为 NaN 值。
python
chats = pd.read_parquet('../input/vtuber-livechat/chats_2021-03.parquet')
### 超级聊天 (`chats_:year:-:month:.csv`)
| 列 | 类型 | 描述 | 标准版本 |
| --- | --- | --- | ------ |
| timestamp | 字符串 | ISO 8601 UTC 时间戳 | 精度有限 |
| id | 字符串 | 聊天 ID | 无 |
| authorName | 字符串 | 作者名称 | 无 |
| authorChannelId | 字符串 | 作者频道 ID | 匿名化 |
| body | 可空字符串 | 聊天消息 | 无 |
| amount | 数字 | 购买金额 | 无 |
| currency | 字符串 | 三字母货币符号 | 无 |
| color | 字符串 | 颜色 | 无 |
| significance | 数字 | 重要性 | 无 |
| videoId | 字符串 | 源视频 ID | 无 |
| channelId | 字符串 | 源频道 ID | 无 |
#### 颜色和重要性
| 颜色 | 重要性 | 购买金额(¥) | 购买金额($) | 最大消息长度 |
| ----- | ------ | ------------ | ------------ | ------------ |
| blue | 1 | ¥ 100 - 199 | $ 1.00 - 1.99 | 0 |
| lightblue | 2 | ¥ 200 - 499 | $ 2.00 - 4.99 | 50 |
| green | 3 | ¥ 500 - 999 | $ 5.00 - 9.99 | 150 |
| yellow | 4 | ¥ 1000 - 1999 | $ 10.00 - 19.99 | 200 |
| orange | 5 | ¥ 2000 - 4999 | $ 20.00 - 49.99 | 225 |
| magenta | 6 | ¥ 5000 - 9999 | $ 50.00 - 99.99 | 250 |
| red | 7 | ¥ 10000 - 50000 | $ 100.00 - 500.00 | 270 - 350 |
#### Pandas 使用
python
import pandas as pd
from glob import iglob
sc = pd.concat([
pd.read_parquet(f)
for f in iglob('../input/vtuber-livechat/superchats_*.parquet')
],
ignore_index=False)
sc.sort_index(inplace=True)
### 删除事件 (`deletion_events.csv`)
| 列 | 类型 | 描述 | 标准版本 |
| --- | --- | --- | ------ |
| timestamp | 字符串 | UTC 时间戳 | 无 |
| id | 字符串 | 聊天 ID | 无 |
| retracted | 布尔值 | 是否由作者自己删除 | 无 |
| videoId | 字符串 | 源视频 ID | 无 |
| channelId | 字符串 | 源频道 ID | 无 |
#### Pandas 使用
python
chats = pd.read_parquet('../input/vtuber-livechat/chats_2021-03.parquet')
delet = pd.read_parquet('../input/vtuber-livechat/deletion_events.parquet')
delet = delet[delet['retracted'] == 0]
delet['deleted_by_mod'] = True
chats = pd.merge(chats, delet[['id', 'deleted_by_mod']], how='left')
chats['deleted_by_mod'].fillna(False, inplace=True)
### 封禁事件 (`ban_events.csv`)
Here **Ban** means either to place user in time out or to permanently hide the user's comments on the channel's current and future live streams. This mixup is due to the fact that these actions are indistinguishable from others with the extracted data from `markChatItemsByAuthorAsDeletedAction` event.
| 列 | 类型 | 描述 | 标准版本 |
| --- | --- | --- | ------ |
| timestamp | 字符串 | UTC 时间戳 | 无 |
| authorChannelId | 字符串 | 频道 ID | 匿名化 |
| videoId | 字符串 | 源视频 ID | 无 |
| channelId | 字符串 | 源频道 ID | 无 |
#### Pandas 使用
python
chats = pd.read_parquet('../input/vtuber-livechat/chats_2021-03.parquet')
ban = pd.read_parquet('../input/vtuber-livechat/ban_events.parquet')
ban['banned'] = True
chats = pd.merge(chats, ban, on=['authorChannelId', 'videoId'], how='left')
chats['banned'].fillna(False, inplace=True)
## 考虑事项
### 匿名化
`id` 和 `authorChannelId` 通过 SHA-1 哈希算法结合少量未公开的盐值进行匿名化处理。
### 处理自定义表情符号
所有自定义表情符号都被替换为 Unicode 替换字符 � (`U+FFFD`)。
### 红undant 禁止和删除事件
对于同一人或聊天,多个管理员的禁止和删除操作将分别记录。为了简化,您可以安全地忽略时间顺序中记录的第一条以外的所有记录。
## 引用
latex
@misc{vtuber-livechat-dataset,
author={Yasuaki Uechi},
title={VTuber 1B: Large-scale Live Chat and Moderation Events Dataset},
year={2022},
month={2},
version={37},
url={https://sigvt.org/vtuber-1b}
}
## 许可证
- 代码:[MIT 许可证](https://github.com/sigvt/vtuber-livechat-dataset/blob/master/LICENSE)
- 数据集:[ODC 公共领域奉献和许可 (PDDL)](https://opendatacommons.org/licenses/pddl/1-0/index.html)
提供机构:
Kaggle



