抖音网络暴力评论数据集
收藏国家基础学科公共科学数据中心2026-01-30 收录
下载链接:
https://nbsdc.cn/general/dataDetail?id=683dea39195d261233189834&type=1
下载链接
链接失效反馈官方服务:
资源简介:
抖音网络暴力评论数据集,抖音网络暴力评论数据集是一个专门用于支撑信息传播系统构建与验证的数据集,总容量为125MB,以CSV格式存储。数据集中的评论内容均来源于抖音平台。通过自主研发的网络爬虫工具,在严格遵守抖音平台使用协议及相关法律法规的前提下,对抖音平台上的评论数据进行定向采集。采集范围主要聚焦于存在网络暴力行为的评论,包括但不限于辱骂、人身攻击、恶意诋毁等类型的言论。为了确保数据的多样性和代表性,采集的评论涵盖了不同主题的视频内容、不同的用户群体以及不同的发布时间段。每个评论样本均包含评论内容、发布时间、评论者ID以及对应的视频ID等关键信息字段;对评论内容进行文本清洗,去除HTML标签、特殊符号、表情符号等非文本信息,并对文本进行分词、去除停用词等预处理操作,以便后续的文本分析和特征提取。处理后的数据为研究网络暴力的传播机制、特征识别以及干预策略提供支持。
Douyin Cyberbullying Comment Dataset is a dataset specifically designed to support the construction and validation of information dissemination systems. It has a total size of 125 MB and is stored in CSV format. All comment content in the dataset originates from the Douyin platform. Targeted collection of comment data from the Douyin platform was carried out using independently developed web crawler tools, in strict compliance with Douyin’s platform usage agreements and relevant laws and regulations. The collection scope mainly focuses on comments containing cyberbullying behaviors, including but not limited to remarks such as insults, personal attacks, and malicious slander. To ensure the diversity and representativeness of the data, the collected comments cover video content of various topics, different user groups, and different release time periods. Each comment sample includes key information fields such as comment content, release time, commenter ID, and corresponding video ID. Text cleaning is performed on the comment content to remove non-text information such as HTML tags, special symbols, and emojis, and preprocessing operations such as word segmentation and stopword removal are carried out on the text to facilitate subsequent text analysis and feature extraction. The processed data provides support for research on the propagation mechanisms, feature recognition, and intervention strategies of cyberbullying.
提供机构:
北京理工大学
搜集汇总
数据集介绍

背景与挑战
背景概述
抖音网络暴力评论数据集是一个包含抖音平台上网络暴力评论的CSV格式数据集,总容量为125MB,涵盖辱骂、人身攻击等言论,用于支持网络暴力传播机制和干预策略的研究。数据集经过文本清洗和预处理,包含评论内容、发布时间等关键信息字段。
以上内容由遇见数据集搜集并总结生成



