five

holodata/sensai

收藏
Hugging Face2021-11-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/holodata/sensai
下载链接
链接失效反馈
官方服务:
资源简介:
# ❤️‍🩹 Sensai: Toxic Chat Dataset Sensai is a toxic chat dataset consists of live chats from Virtual YouTubers' live streams. Download the dataset from [Kaggle Datasets](https://www.kaggle.com/uetchy/sensai) and join `#livechat-dataset` channel on [holodata Discord](https://holodata.org/discord) for discussions. ## Provenance - **Source:** YouTube Live Chat events (all streams covered by [Holodex](https://holodex.net), including Hololive, Nijisanji, 774inc, etc) - **Temporal Coverage:** From 2021-01-15T05:15:33Z - **Update Frequency:** At least once per month ## Research Ideas - Toxic Chat Classification - Spam Detection - Sentence Transformer for Live Chats See [public notebooks](https://www.kaggle.com/uetchy/sensai/code) for ideas. ## Files | filename | summary | size | | ------------------------- | -------------------------------------------------------------- | -------- | | `chats_flagged_%Y-%m.csv` | Chats flagged as either deleted or banned by mods (3,100,000+) | ~ 400 MB | | `chats_nonflag_%Y-%m.csv` | Non-flagged chats (3,100,000+) | ~ 300 MB | To make it a balanced dataset, the number of `chats_nonflags` is adjusted (randomly sampled) to be the same as `chats_flagged`. Ban and deletion are equivalent to `markChatItemsByAuthorAsDeletedAction` and `markChatItemAsDeletedAction` respectively. ## Dataset Breakdown ### Chats (`chats_%Y-%m.csv`) | column | type | description | | --------------- | ------ | ---------------------------- | | body | string | chat message | | membership | string | membership status | | authorChannelId | string | anonymized author channel id | | channelId | string | source channel id | #### Membership status | value | duration | | ----------------- | ------------------------- | | unknown | Indistinguishable | | non-member | 0 | | less than 1 month | < 1 month | | 1 month | >= 1 month, < 2 months | | 2 months | >= 2 months, < 6 months | | 6 months | >= 6 months, < 12 months | | 1 year | >= 12 months, < 24 months | | 2 years | >= 24 months | #### Pandas usage Set `keep_default_na` to `False` and `na_values` to `''` in `read_csv`. Otherwise, chat message like `NA` would incorrectly be treated as NaN value. ```python import pandas as pd from glob import iglob flagged = pd.concat([ pd.read_csv(f, na_values='', keep_default_na=False) for f in iglob('../input/sensai/chats_flagged_*.csv') ], ignore_index=True) ``` ## Consideration ### Anonymization `authorChannelId` are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt. ### Handling Custom Emojis All custom emojis are replaced with a Unicode replacement character `U+FFFD`. ## Citation ```latex @misc{sensai-dataset, author={Yasuaki Uechi}, title={Sensai: Toxic Chat Dataset}, year={2021}, month={8}, version={31}, url={https://github.com/holodata/sensai-dataset} } ``` ## License - Code: [MIT License](https://github.com/holodata/sensai-dataset/blob/master/LICENSE) - Dataset: [ODC Public Domain Dedication and Licence (PDDL)](https://opendatacommons.org/licenses/pddl/1-0/index.html)
提供机构:
holodata
原始信息汇总

数据集概述

数据集名称

Sensai: Toxic Chat Dataset

数据集来源

  • 源数据: YouTube Live Chat事件,涵盖了Holodex下的所有直播流,包括Hololive, Nijisanji, 774inc等。
  • 时间覆盖: 从2021-01-15T05:15:33Z开始。
  • 更新频率: 至少每月更新一次。

数据集内容

文件详情

文件名 概述 大小
chats_flagged_%Y-%m.csv 被标记为删除或被管理员禁止的聊天记录(超过3,100,000条) 约400 MB
chats_nonflag_%Y-%m.csv 未被标记的聊天记录(超过3,100,000条) 约300 MB

数据集结构

聊天记录 (chats_%Y-%m.csv)

列名 类型 描述
body string 聊天消息
membership string 会员状态
authorChannelId string 匿名的作者频道ID
channelId string 源频道ID

会员状态

持续时间
unknown 无法区分
non-member 0
less than 1 month < 1个月
1 month >= 1个月,< 2个月
2 months >= 2个月,< 6个月
6 months >= 6个月,< 12个月
1 year >= 12个月,< 24个月
2 years >= 24个月

数据集处理

匿名化

  • authorChannelId 使用SHA-1哈希算法进行匿名处理,并加入未公开的盐值。

处理自定义表情

  • 所有自定义表情被替换为Unicode替换字符 U+FFFD

许可证

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作