a dataset of 26,935 comments on YouTube videos related to Hanfu
收藏DataCite Commons2025-04-14 更新2025-04-16 收录
下载链接:
https://data.mendeley.com/datasets/5p8yvmcf5t/1
下载链接
链接失效反馈官方服务:
资源简介:
The research data were extracted using Python and the YouTube Data API provided by Google. A total of 1,100 Hanfu-related video samples were collected using the keyword "Chinese Hanfu". This keyword was selected based on preliminary research and careful consideration, as it was found to be the most direct and widely used term relevant to the focus of this study. Additionally, an analysis of search trend charts on YouTube indicated that " Chinese Hanfu " has experienced a notable surge in search volume, which may be associated with China’s visa-free entry policy introduced for multiple countries in 2024.
To balance data quality and collection efficiency, an exponential-logarithmic sampling formula was employed. The number of comments sampled (C) was calculated according to the following equation:
C=50×log (Video Views)+10
To address the large variance in video popularity, the logarithm of video views was applied to compress the influence of high view counts. A coefficient of 50 and a constant of 10 were used to adjust the overall sampling scale and to ensure adequate comment coverage for videos with lower view counts. Specifically, for videos with fewer than 1,000 views, all comments were collected. For videos with more than 1,000 views, the number of sampled comments was determined using the specified formula. Compared to simple random sampling, this approach helps reduce data redundancy, while balancing the representativeness of both high-traffic and niche videos, thereby providing a robust and efficient foundation for trend analysis.
All selected videos were published before 31 December 2024, the end date of data collection. No fixed start date was set, as the purpose of this study was to support subsequent time-series analyses of keyword trends. To capture long-term dynamics and patterns, videos were sampled across an extended time frame. This design ensured both the breadth and continuity of the dataset, facilitating the exploration of long-term behavioral patterns and structural regularities across a broad temporal spectrum.
The following ten categories of metadata were retrieved using a web crawler: video URL, upload date, title, textual description, number of views, number of comments, content of each comment, number of likes per comment, timestamp of each comment, and total number of comments per video. In total, 26935valid comments were collected.
提供机构:
Mendeley Data
创建时间:
2025-04-14



