five

Dataset of traffic accidents reported on Twitter Bogotá Colombia

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/5548474
下载链接
链接失效反馈
官方服务:
资源简介:
1 Classification Dataset This dataset for the classification model contains 3,804 tweets, where 1,902 are related to traffic accident reports (TA, positive class) and 1,902 are unrelated (NTA, negative class). For training the tweet classification model, a collaborative labeling strategy was designed. Here, 30 people labeled data according to the instructions given. Each participant had to evaluate a tweet to manually classify it into one of three categories defined as: traffic accident related, unrelated and don´t know/no response. Each tweet was evaluated by 3 participants. The correct label was selected by voting; the 3 people must agree on the selected label, otherwise the tweet was excluded from training. This process took a month and required the development and deployment of a web application. 2 NER Dataset (Named Entity Recognition) For the entity recognition model training, a sample of the filtered tweets resulting from the previous classification phase was taken. 1,340 tweets were extracted, where 800 are from “unofficial” users, almost 60% of the sample. These tweets were user reports on traffic incident occurred in Bogota from October 2018 to July 2019, including other tweets that contained some location references such as reports on the state of road infrastructure; some tweets from the years 2016 and 2017 were also included. Although these posts were not related to accidents per se, they were selected because they contained location information. The purpose was to train a model that would recognize these entities, because a classifier of accident-related tweets was previously created. Additionally, the dataset was split, reserving 1,072 tweets for training and 268 for evaluation. This dataset was manually labeled using the IOB (Inside-outside- beginning) format. The labeling tool called Brat Annotation Tools was used for this task. The labels defined are Location, which refers to the location of the report; and Time, which refers to the time or date of the incident. Accordingly, 5 labels were generated: B-loc, I-loc, B-time, I-time and O. The O label refers to Others. 3 Traffic accident Twitter geolocation A dataset with 26362 traffic accident tweets with the coordinates of the incident and the date of publication.
创建时间:
2021-10-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作