Detecting-Warning-Labels-on-E-Cigarette-Content

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/15191021

下载链接

链接失效反馈

官方服务：

资源简介：

Detecting-Warning-Labels-on-E-Cigarette-Content Detecting Warning Labels on E-Cigarette Content Across Social Media Platforms Introduction This repository contains scripts for collecting data from TikTok and YouTube, processing them, and feeding them to a rule-based classifier. The pipeline consists of multiple steps, including video downloading, screenshot extraction, OCR processing, language detection, classification, and statistical analysis. Technical requirements Before proceeding, ensure that you have the following installed: Python 3.x An Oracle cloud instance A Box account with API access Basic command-line knowledge Required Dependencies Install the following Python libraries before running the scripts: pip install opencv-python pandas numpy pytesseract langdetect requests boxsdk Warning Label Detection Workflow The figure illustrates the framework for collecting and extracting text from YouTube and TikTok videos, detailing the language detection process and the development of a rule-based classifier for warning labels. Box Download First we need to sign up for a free version of box. We get to the developer console and then we access the APP console then we click create new app. Then we click on custom app, and then we create a custom app. Then we choose the authnetication type which is auth 2 and create the app afterwards. Now in the configuration tab of the App that we created we click on App access only. And then we choose Make API calls using using the as-user header and generate user access tokens. and then we get the client ID, client secret and developer token. The developer token has to be generated every 60 minutes because it will expire in every 60 minutes. We now get the folder ID from box. On the Oracle Instance we write a script to download all the videos. (get_videos.py) Image Processing Image processing procedures for each social media platform Screenshots The script takes screenshots every one second until the max time which is 79 seconds. Oracle Vision Takes the screenshots in form of images and turns them into text. (ocr.py) Now we write a script called (remove_null.py) that gets rid of the rows containing no text. We then write a script that gets rid of the duplicate texts inside the output text file called (remove_textdup.py) Language Detection We write a program that can take all the text extracted from the images in the csv file and detect their language and output a file called, extracted_lang_output.txt. (lang_detector.py) In the next step, in the script called (lang_score.py), we parse the json formatted text file (extracted_lang_output.txt) and we set a threshold of 90% and say that if the language detected score is higher than 90% then consider it a valid prediction and write it to a csv file (language_score.csv) alongside its language name and the text. We then remove the duplicates with a script called (remove_duplicates.py) We then get the english text rows by performing the command cat unique_lang_score.csv | grep English > warnings.txt Classifier We write a classifier script to check if each text row satisfies the classifier conditions. (classifier.py) Condition Example Images An example YouTube post which contains a warning label that meets Condition 1 Warning Label from TikTok that fulfills Condition 1 and 2 Example of Warning Label from YouTube that fulfills Condition 1 and 2 (brand names blurred for anonymity) Conditions We write a script that checks if the text row belongs to condition 1 or condition 2. (parser.py) We write a script that counts how many is condition 1 and how many is condition 2. (count.py) We write a script that checks how many of the conditions overlap. (overlap.py) Video Length We write a script that calculates the length of all videos. (All_video_length.py) We write a script that normalizes (format: minutes and seconds) the durations of the videos. (handle.py) We write a script that calculates the average of all video lengths. (average.py) We write a script that claculates the standard deviation of the video lengths. (std.py)

创建时间：

2025-04-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集