five

a dataset of 26,935 comments on YouTube videos related to Hanfu

收藏
DataCite Commons2025-04-14 更新2025-04-16 收录
下载链接:
https://data.mendeley.com/datasets/5p8yvmcf5t
下载链接
链接失效反馈
官方服务:
资源简介:
The research data were extracted using Python and the YouTube Data API provided by Google. A total of 1,100 Hanfu-related video samples were collected using the keyword "Chinese Hanfu". This keyword was selected based on preliminary research and careful consideration, as it was found to be the most direct and widely used term relevant to the focus of this study. Additionally, an analysis of search trend charts on YouTube indicated that " Chinese Hanfu " has experienced a notable surge in search volume, which may be associated with China’s visa-free entry policy introduced for multiple countries in 2024. To balance data quality and collection efficiency, an exponential-logarithmic sampling formula was employed. The number of comments sampled (C) was calculated according to the following equation: C=50×log (Video Views)+10 To address the large variance in video popularity, the logarithm of video views was applied to compress the influence of high view counts. A coefficient of 50 and a constant of 10 were used to adjust the overall sampling scale and to ensure adequate comment coverage for videos with lower view counts. Specifically, for videos with fewer than 1,000 views, all comments were collected. For videos with more than 1,000 views, the number of sampled comments was determined using the specified formula. Compared to simple random sampling, this approach helps reduce data redundancy, while balancing the representativeness of both high-traffic and niche videos, thereby providing a robust and efficient foundation for trend analysis. All selected videos were published before 31 December 2024, the end date of data collection. No fixed start date was set, as the purpose of this study was to support subsequent time-series analyses of keyword trends. To capture long-term dynamics and patterns, videos were sampled across an extended time frame. This design ensured both the breadth and continuity of the dataset, facilitating the exploration of long-term behavioral patterns and structural regularities across a broad temporal spectrum. The following ten categories of metadata were retrieved using a web crawler: video URL, upload date, title, textual description, number of views, number of comments, content of each comment, number of likes per comment, timestamp of each comment, and total number of comments per video. In total, 26935valid comments were collected.

本研究数据通过Python及谷歌(Google)提供的YouTube数据API(YouTube Data API)提取。本研究以关键词‘中国汉服(Chinese Hanfu)’为检索词,共采集到1100条汉服相关视频样本。该关键词的选取经过前期调研与审慎考量,是贴合本研究主题最直接且应用最广泛的术语。此外,对YouTube搜索趋势图表的分析显示,‘中国汉服(Chinese Hanfu)’的搜索量出现显著增长,这或与2024年中国针对多国推出的免签入境政策相关。 为平衡数据质量与采集效率,本研究采用指数-对数采样公式。采样评论数(C)的计算公式如下:C=50×log(视频播放量)+10。为缓解视频热度差异过大的问题,研究对视频播放量取对数以压缩高播放量视频的影响权重;通过设置系数50与常数10,可调整整体采样规模,同时确保低播放量视频获得充足的评论覆盖。具体而言,播放量低于1000的视频将采集全部评论;播放量高于1000的视频,则通过上述公式确定采样评论数。相较于简单随机抽样,该方法能够减少数据冗余,同时兼顾高流量视频与小众视频的代表性,为趋势分析提供稳健且高效的数据基础。 所有入选视频均发布于2024年12月31日(数据采集截止日期)之前。本研究未设置固定起始采集日期,旨在为后续的关键词趋势时序分析提供支撑。为捕捉长期动态与规律,研究在扩展的时间范围内进行视频采样,该设计既保障了数据集的广度,又维持了时序连续性,便于在宽泛的时间维度上探索长期行为模式与结构规律。 本研究通过网络爬虫共提取10类元数据(metadata):视频链接、上传日期、标题、文本描述、播放量、评论总数、单条评论内容、单条评论获赞数、单条评论时间戳,以及单视频总评论数。最终共采集到26935条有效评论。
提供机构:
Mendeley Data
创建时间:
2025-04-14
二维码
社区交流群
二维码
科研交流群
商业服务