five

Panama City, Panama road traffic indicidents 2014-2022 - Social Media Dataset (in Spanish)

收藏
Mendeley Data2024-03-27 更新2024-06-28 收录
下载链接:
https://data.mendeley.com/datasets/tmwrd45m7x
下载链接
链接失效反馈
官方服务:
资源简介:
The raw data set is composed of 200,410 Tweets in Spanish language from the road traffic social reporting account @traficocpanama (1_raw_data_200410.csv). Tweets were collected between January 2014 and May 2022. The data were collected using a the Python Programming language with modules Selenium, and Tweepy. This raw data set was first processed by keeping tweets with at least 3 words and then stop words (see stop-words.csv) were removed. Which brought the number of tweets to 192,707 (2_preliminar_data_192707.csv). The second cut-off was made via a machine learning classification model to sort tweets that had a relation with 1) Accidents (in Spanish: Choques, accidentes, colisiones, vuelcos, atropellos; in English: Crashes, accidents, collisions, overturns, run-overs). 2) Obstacles (in Spanish: Tranques, huelgas, motines, paros, protestas, trabajos en vía, cierres; in English: Traffic jams, strikes, riots, stoppages, protests, road works, closures). 3) Dangers (in Spanish: Incendios, inundaciones, lluvias fuertes; in English: Fires, floods, heavy rains), whichs brought the Tweet number to 120,000 tweets (3_sample_class_120000.csv). Finally, a machine learning incident categorization model was trained on 51,000 Tweets between categories: accident, obstacle, danger (4_sample_categ_51000.csv) This data set is intended for academic use and research in natural language processing (NLP) in Spanish. Specially, for road traffic incident detection.

本原始数据集包含来自西班牙语道路交通社交播报账号@traficocpanama的200410条推文,对应数据文件为1_raw_data_200410.csv。推文采集时段为2014年1月至2022年5月,采集过程采用Python编程语言,结合Selenium与Tweepy模块完成。首先对原始数据集进行预处理:先保留至少包含3个单词的推文,随后移除停用词(详见stop-words.csv),处理后推文总量降至192707条,对应文件为2_preliminar_data_192707.csv。随后通过机器学习分类模型进行第二轮筛选,保留与以下三类事件相关的推文:1)事故类(西班牙语:Choques, accidentes, colisiones, vuelcos, atropellos;英语:Crashes, accidents, collisions, overturns, run-overs);2)障碍类(西班牙语:Tranques, huelgas, motines, paros, protestas, trabajos en vía, cierres;英语:Traffic jams, strikes, riots, stoppages, protests, road works, closures);3)危险类(西班牙语:Incendios, inundaciones, lluvias fuertes;英语:Fires, floods, heavy rains)。经此步骤后,推文数量缩减至120000条,对应文件为3_sample_class_120000.csv。最后,基于51000条推文训练得到机器学习事件分类模型,将事件划分为事故、障碍、危险三类,对应数据集文件为4_sample_categ_51000.csv。本数据集仅面向西班牙语自然语言处理(Natural Language Processing, NLP)领域的学术研究,尤其适用于道路交通事件检测任务。
创建时间:
2024-01-23
二维码
社区交流群
二维码
科研交流群
商业服务