five

(Non)smoking comments classified by arguments, gender and age

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14782952
下载链接
链接失效反馈
官方服务:
资源简介:
Data collection. Methods of classification The comments were collected by the authors during March-August 2024 from the most popular YouTube videos, in which the topic of smoking was discussed in Russian. When selecting videos, the main criteria were the following: 1) relevance of the video title (the topic of smoking); 2) language (Russian); 3) popularity (number of views). Our goal was to collect all of the most popular videos, since they involved a large number of people in discussions. The search for relevant videos was carried out directly on the YouTube platform. Finally, we collected 204 videos (see Sheet ‘YouTube Video 204’). The final database includes more than 165 thousand comments (see Sheet ‘All Comments 165th’). Sentiment classification was made by Romanov’s method (Romanov A.S. Methodology for identifying the author of text information for solving cybersecurity problems. Abstract of the dissertation for the degree of Doctor of Technical Sciences. Tomsk, 2024 (Романов А.С. Методология идентификации автора текстовой информации для решения задач кибербезопасности. Автореферат диссертации на соискание ученой степени доктора технических наук. Томск, 2024)).   Using generative artificial intelligence LLM gemma2-9b-it, we classified more than 58 thousand comments on the presence and type of argument to quit smoking or not to quit smoking (see Sheet ‘Argument 58ths’). Sample 58ths from 165ths comments is the sample from the most populated videos. For more information on the classification of arguments, see Kalabikhina, I.E., Kazbekova, Z.G., & Zubova, E.A. (2024). Arguments of social media users regarding quitting smoking (based on machine learning methods). Management Issues, 18(5), 48–67 (Калабихина, И.Е., Казбекова, З.Г., & Зубова, Е.А. (2024). Доводы пользователей социальных медиа по поводу отказа от табакокурения (на основе методов машинного обучения). Вопросы управления, 18 (5), 48–67). Finally, on the basis of generative artificial intelligence LLM gemma2-9b-it we classified the comments that contained an argument to quit smoking or not to quit smoking according to gender and age of the comment author. This sample consists of 5.5 thousand of classified comments with argument (see Sheet ‘Gender&Age 5.5ths’). Promt Example: The best Promt for gender classification (84%, in Russian): Определи пол авторов комментариев. Представь результаты в виде таблицы из 2 столбцов: первый - объяснение выбора, второй - пол автора (или мужской, или женский, или невозможно определить). В первую очередь обращай внимание на окончания глаголов, указывающие на принадлежность к мужскому или женскому полу. Пример: "Я сделала это" - автор этого комментария женщина, это видно по форме глагола. "Я сделал это" - автор этого комментария - мужчина.   Data format and structure The database includes data on (non)smoking comments in .xls formats. The variable ‘Sentiment’ includes the following values (see Sheet ‘All Comments 165th’): “NEGATIVE” “neutral” “POSITIVE” The variable ‘Argument type’ includes the following values (see Sheet ‘Argument 58ths’): “1” – A comment does not contain an argument to quit smoking or an argument not to quit smoking. “2” – A comment contains an argument to quit smoking due to the harm it causes to the smoker's health. “3” – A comment contains an argument to quit smoking due the high cost of cigarettes. “4” – A comment contains an argument to quit smoking for reasons other than caring for own health and saving money. “5” – A comment contains an argument not to quit smoking due to fear of gaining excess weight. “6” – A comment contains an argument not to quit smoking for reasons other than fear of gaining excess weight. “0” – Classification error.   The variables ‘gender’ and ‘age’ include the following values (see Sheet ‘Gender&Age 5.5ths’): The variable ‘gender’ includes the following values: “1” – A comment was written by a man. “2” – A comment was written by a woman. “3” – The author’s gender cannot be identified. “0” – Classification error.   The variable ‘age’ includes the following values: “1” – A comment was written by a person under 18 years old. “2” – A comment was written by a person aged 19-34. “3” – A comment was written by a person aged 35+.  “0” – Classification error.
创建时间:
2025-01-31
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作