Detecting-Warning-Labels-on-E-Cigarette-Content
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/15191021
下载链接
链接失效反馈官方服务:
资源简介:
Detecting-Warning-Labels-on-E-Cigarette-Content
Detecting Warning Labels on E-Cigarette Content Across Social Media Platforms
Introduction
This repository contains scripts for collecting data from TikTok and YouTube, processing them, and feeding them to a rule-based classifier. The pipeline consists of multiple steps, including video downloading, screenshot extraction, OCR processing, language detection, classification, and statistical analysis.
Technical requirements
Before proceeding, ensure that you have the following installed:
Python 3.x
An Oracle cloud instance
A Box account with API access
Basic command-line knowledge
Required Dependencies
Install the following Python libraries before running the scripts:
pip install opencv-python pandas numpy pytesseract langdetect requests boxsdk
Warning Label Detection Workflow
The figure illustrates the framework for collecting and extracting text from YouTube and TikTok videos, detailing the language detection process and the development of a rule-based classifier for warning labels.
Box Download
First we need to sign up for a free version of box.
We get to the developer console and then we access the APP console then we click create new app.
Then we click on custom app, and then we create a custom app.
Then we choose the authnetication type which is auth 2 and create the app afterwards.
Now in the configuration tab of the App that we created we click on App access only.
And then we choose Make API calls using using the as-user header and generate user access tokens.
and then we get the client ID, client secret and developer token.
The developer token has to be generated every 60 minutes because it will expire in every 60 minutes.
We now get the folder ID from box.
On the Oracle Instance we write a script to download all the videos. (get_videos.py)
Image Processing
Image processing procedures for each social media platform
Screenshots
The script takes screenshots every one second until the max time which is 79 seconds.
Oracle Vision
Takes the screenshots in form of images and turns them into text. (ocr.py)
Now we write a script called (remove_null.py) that gets rid of the rows containing no text.
We then write a script that gets rid of the duplicate texts inside the output text file called (remove_textdup.py)
Language Detection
We write a program that can take all the text extracted from the images in the csv file and detect their language and output a file called, extracted_lang_output.txt. (lang_detector.py)
In the next step, in the script called (lang_score.py), we parse the json formatted text file (extracted_lang_output.txt) and we set a threshold of 90% and say that if the language detected score is higher than 90% then consider it a valid prediction and write it to a csv file (language_score.csv) alongside its language name and the text.
We then remove the duplicates with a script called (remove_duplicates.py)
We then get the english text rows by performing the command
cat unique_lang_score.csv | grep English > warnings.txt
Classifier
We write a classifier script to check if each text row satisfies the classifier conditions. (classifier.py)
Condition Example Images
An example YouTube post which contains a warning label that meets Condition 1
Warning Label from TikTok that fulfills Condition 1 and 2
Example of Warning Label from YouTube that fulfills Condition 1 and 2 (brand names blurred for anonymity)
Conditions
We write a script that checks if the text row belongs to condition 1 or condition 2. (parser.py)
We write a script that counts how many is condition 1 and how many is condition 2. (count.py)
We write a script that checks how many of the conditions overlap. (overlap.py)
Video Length
We write a script that calculates the length of all videos. (All_video_length.py)
We write a script that normalizes (format: minutes and seconds) the durations of the videos. (handle.py)
We write a script that calculates the average of all video lengths. (average.py)
We write a script that claculates the standard deviation of the video lengths. (std.py)
创建时间:
2025-04-10



