Cross-Lingual Dataset of Crisis-Related Social Media
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7714014
下载链接
链接失效反馈官方服务:
资源简介:
The cross-lingual natural disaster dataset includes public tweets collected using Twitter’s public API, filtering by location-related keywords and date, without using any additional filtering (e.g., we did not restrict the query to specific languages). We considered five disaster events between January 2020 and February 2021 that received substantial news coverage internationally.
All messages include a “language” field computed by Twitter us ing a language detection model developed specifically for tweets. We counted the number of messages per language in each event. Three of the top languages were common to all the studied events: English (ISO 639-1 code: en), Spanish (es), and French (fr). Additionally, we found several hundred messages for each event in other languages, including Catalan (ca), Tagalog (tl), Croatian (hr), German (de), Japanese (ja), Indonesian (id), and Portuguese (pt).
After collecting the data, we labelled tweets or their translation to English that contained potentially informative factual information. We name this group of tweets “informative messages.” Next, we used crowdsourcing to further categorize the messages into various informational categories. We asked three different workers to label each of the approximately 5,700 informative messages across languages. The target categories were based on an ontology from TREC-IS 2018, where we grouped some low level ontology categories into higher-level ones.
跨语言自然灾害数据集(cross-lingual natural disaster dataset)依托Twitter公开API采集公开推文,仅通过位置相关关键词与日期维度进行筛选,未施加任何额外过滤规则(例如未将查询范围限定于特定语言)。本次研究选取了2020年1月至2021年2月期间5起获得国际广泛新闻报道的灾害事件。
所有推文均包含由Twitter使用专为推文场景开发的语言检测模型计算得到的“语言”字段。我们统计了各灾害事件下每种语言的推文总量。所有纳入研究的灾害事件均共享三大主流语言:英语(ISO 639-1代码:en)、西班牙语(es)与法语(fr)。此外,各事件均存在数百条其他语言的推文,涵盖加泰罗尼亚语(ca)、他加禄语(tl)、克罗地亚语(hr)、德语(de)、日语(ja)、印度尼西亚语(id)与葡萄牙语(pt)等。
数据采集完成后,我们对包含潜在有效事实信息的推文及其英译版本进行标注,并将该类推文命名为“有效信息推文(informative messages)”。随后,我们通过众包渠道将这些推文进一步归类至多个信息类别。我们邀请三名独立标注人员对跨语言的约5700条有效信息推文逐一进行标注。本次标注的目标类别基于TREC-IS 2018的本体体系,我们将部分底层本体类别合并为更高层级的类别。
创建时间:
2023-03-10



