five

quanml0703/CrisisMMD_v2

收藏
Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/quanml0703/CrisisMMD_v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 task_categories: - image-classification language: - en tags: - Disaster - Crisis Informatics pretty_name: 'CrisisMMD: Multimodal Twitter Datasets from Natural Disasters' size_categories: - 10K<n<100K configs: - config_name: humanitarian data_files: - split: train path: humanitarian/train_image_text_label.csv features: image: dtype: image tweet_text: dtype: string label: dtype: string --- # CrisisMMD: Multimodal Twitter Datasets from Natural Disasters The **CrisisMMD** multimodal Twitter dataset consists of several thousand manually annotated tweets and images collected during seven major natural disasters, including earthquakes, hurricanes, wildfires, and floods from 2017. The dataset includes three types of annotations: On HuggingFace, we hosted version 2.0 of the CrisisMMD dataset. Please see further information below. ### Disaster Response Tasks 1. **Task 1: Informative vs Not Informative** - Informative - Not informative - "Don't know or can't judge" → **Removed in version 2.0** 2. **Task 2: Humanitarian Categories** - Affected individuals - Infrastructure and utility damage - Injured or dead people - Missing or found people - Rescue, volunteering, or donation effort - Vehicle damage - Other relevant information - "Not relevant or can't judge" → **Updated to "Not humanitarian" in version 2.0** 3. **Task 3: Damage Severity Assessment** - Severe damage - Mild damage - Little or no damage - "Don't know or can't judge" ## Datasets Details The keywords used for collecting tweets, along with the start and end dates for each event, are outlined in the following table. | Crisis Name | Keywords | Start Date | End Date | |--------------------|------------------------------------------------|-------------------|-------------------| | [Hurricane Irma](https://en.wikipedia.org/wiki/Hurricane_Irma) | Hurricane Irma, Irma storm, Storm Irma, etc. | Sep 6, 2017 | Sep 21, 2017 | | [Hurricane Harvey](https://en.wikipedia.org/wiki/Hurricane_Harvey) | Hurricane Harvey, Tornado, etc. | August 25, 2017 | September 20, 2017| | [Hurricane Maria](https://en.wikipedia.org/wiki/Hurricane_Maria) | Hurricane Maria, Maria Storm, etc. | September 20, 2017| November 13, 2017 | | [California wildfires](https://en.wikipedia.org/wiki/List_of_California_wildfires) | California fire, USA Wildfire, etc. | October 10, 2017 | October 27, 2017 | ### Event-wise data distribution For each event, we collected tweets and associated images, filtered and sampled them for the annotation. ## [**Data distribution from the CrisisMMD version v1.0**](https://crisisnlp.qcri.org/data/crisismmd/CrisisMMD_v1.0.tar.gz) | Crisis Name | # Tweets | # Images | # Filtered Tweets | # Sampled Tweets | # Sampled Images | |------------------------|-------------|------------|-------------------|------------------|------------------| | Hurricane Irma | 3,517,280 | 176,972 | 5,739 | 4,041 | 4,525 | | Hurricane Harvey | 6,664,349 | 321,435 | 19,967 | 4,000 | 4,443 | | Hurricane Maria | 2,953,322 | 52,231 | 6,597 | 4,000 | 4,562 | | California wildfires | 455,311 | 10,130 | 1,488 | 1,486 | 1,589 | | Mexico earthquake | 383,341 | 7,111 | 1,241 | 1,239 | 1,382 | | Iraq-Iran earthquake | 207,729 | 6,307 | 501 | 499 | 600 | | Sri Lanka floods | 41,809 | 2,108 | 870 | 832 | 1,025 | | **Total** | **14,223,141** | **576,294** | **36,403** | **16,097** | **18,126** | ## Data preparation for multimodal baseline For the multimodal baseline experiments, we first combined the tweet text and image from all events. It resulted in 24 duplicate entries (tweet ids: text and associated images). We manually checked these duplicate entries and kept the one, which were annotated properly. We changed the label “Not relevant or can’t judge” to “Not humanitarian”. In addition, as the annotation consists of a label - “don't know or can't not judge”, we also removed them for the classification experiments. Hence, this preprocessing part filtered out 39 tweets and associated 44 images. The resulted total dataset consists of 16058 and 18082 tweet texts and images, respectively as shown in the following table. This version of this dataset is released as version 2.0 and is available for download. ## [**Data distribution from the CrisisMMD version v2.0**](https://crisisnlp.qcri.org/data/crisismmd/CrisisMMD_v2.0.tar.gz) In this version, the "Not relevant or can't judge" label has been mapped to "Not humanitarian" for the humanitarian task. Additionally, the "Not informative" label from the informative task has also been mapped to "Not humanitarian" for the humanitarian task. Duplicate entries from different events have been removed. ### Informativeness | | Text | Image | |---------------|--------|--------| | Informative | 11,509 | 9,374 | | Not informative | 4,549 | 8,708 | | **Total** | 16,058 | 18,082 | ### Humanitarian | | Text | Image | |-------------------------------|--------|-------| | Affected individuals | 472 | 562 | | Infrastructure and utility damage | 1,210 | 3,624 | | Injured or dead people | 486 | 110 | | Missing or found people | 40 | 14 | | Not humanitarian | 4,549 | 8,708 | | Other relevant information | 5,954 | 2,529 | | Rescue, volunteering, or donation effort | 3,293 | 2,231 | | Vehicle damage | 54 | 304 | | **Total** | 16,058 | 18,082 | ### Damage Severity | | Text | Image | |-----------------|------|-------| | Little or no damage | - | 475 | | Mild damage | - | 839 | | Severe damage | - | 2,212 | | **Total** | - | 3,526 | ## Downloads (Alternate options) - **CrisisMMD dataset version v2.0**: [Download labeled images and tweets (~1.8GB)](https://crisisnlp.qcri.org/data/crisismmd/CrisisMMD_v2.0.tar.gz) - **Datasplit**: [Annotations Download](https://crisisnlp.qcri.org/data/crisismmd/crisismmd_datasplit_all.zip) - **Datasplit for multimodal baseline with agreed labels**: [Annotations Download](https://crisisnlp.qcri.org/data/crisismmd/crisismmd_datasplit_agreed_label.zip) ## Citation **Please cite the following papers if you use any of these resources in your research.** 1. [Ferda Ofli](https://sites.google.com/site/ferdaofli/), [Firoj Alam](https://firojalam.one/), and [Muhammad Imran](http://mimran.me/), [**Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response**](https://arxiv.org/abs/2004.11838), In Proceedings of the 17th International Conference on Information Systems for Crisis Response and Management (ISCRAM), 2020, USA. 2. [Firoj Alam](https://firojalam.one/), [Ferda Ofli](https://sites.google.com/site/ferdaofli/), and [Muhammad Imran](http://mimran.me/), [**CrisisMMD: Multimodal Twitter Datasets from Natural Disasters**](https://arxiv.org/pdf/1805.00713.pdf), In Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM), 2018, Stanford, California, USA. ``` @InProceedings{crisismmd2018icwsm, author = {Alam, Firoj and Ofli, Ferda and Imran, Muhammad}, title = {{CrisisMMD}: Multimodal Twitter Datasets from Natural Disasters}, booktitle = {Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM)}, year = {2018}, month = {June}, date = {23-28}, location = {USA} } @inproceedings{multimodalbaseline2020, Author = {Ferda Ofli and Firoj Alam and Muhammad Imran}, Booktitle = {17th International Conference on Information Systems for Crisis Response and Management}, Keywords = {Multimodal deep learning, Multimedia content, Natural disasters, Crisis Computing, Social media}, Month = {May}, Organization = {ISCRAM}, Publisher = {ISCRAM}, Title = {Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response}, Year = {2020} } ```

--- license: CC BY-NC-SA 4.0 任务类别: - 图像分类(image-classification) 语言: - 英语(en) 标签: - 灾害(Disaster) - 危机信息学(Crisis Informatics) 数据集名称:"CrisisMMD:面向自然灾害的多模态(Multimodal)Twitter数据集" 样本量区间: - 10000 < 样本量 < 100000 配置项: - 配置名称:人道主义(humanitarian) 数据文件: - 拆分方式:训练集(train) 路径:humanitarian/train_image_text_label.csv 特征: 图像: 数据类型:图像(image) 推文文本(tweet_text): 数据类型:字符串 标签: 数据类型:字符串 --- # CrisisMMD:面向自然灾害的多模态(Multimodal)Twitter数据集 **CrisisMMD**多模态Twitter数据集包含数千条经人工标注的推文(tweet)与关联图像,采集自2017年以来的7起重大自然灾害,涵盖地震、飓风、山火与洪水等灾种。该数据集包含三类标注任务: 我们在HuggingFace上托管了CrisisMMD v2.0版本,详细信息如下文所述。 ### 灾害响应标注任务 1. **任务1:信息性判别** - 信息性 - 非信息性 - "不确定或无法判断"→ **v2.0版本中已移除** 2. **任务2:人道主义分类** - 受影响民众 - 基础设施与公用设施损毁 - 伤亡人员 - 失联或已找到人员 - 救援、志愿或捐赠行动 - 车辆损毁 - 其他相关信息 - "不相关或无法判断"→ **v2.0版本中更新为"非人道主义相关"** 3. **任务3:损害严重程度评估** - 严重损毁 - 轻度损毁 - 轻微或无损毁 - "不确定或无法判断" ## 数据集详情 用于采集推文的关键词、各事件的起止日期如下表所示。 | 危机名称 | 关键词 | 起始日期 | 结束日期 | |-----------------|----------------------------------------------|-----------------|-------------------| | [飓风厄玛(Hurricane Irma)](https://en.wikipedia.org/wiki/Hurricane_Irma) | 飓风厄玛、厄玛风暴、厄玛飓风等 | 2017年9月6日 | 2017年9月21日 | | [飓风哈维(Hurricane Harvey)](https://en.wikipedia.org/wiki/Hurricane_Harvey) | 飓风哈维、龙卷风等 | 2017年8月25日 | 2017年9月20日 | | [飓风玛丽亚(Hurricane Maria)](https://en.wikipedia.org/wiki/Hurricane_Maria) | 飓风玛丽亚、玛丽亚风暴等 | 2017年9月20日 | 2017年11月13日 | | [加州山火(California wildfires)](https://en.wikipedia.org/wiki/List_of_California_wildfires) | 加州火灾、美国山火等 | 2017年10月10日 | 2017年10月27日 | ### 按事件划分的数据分布 我们为每个事件采集推文与关联图像,并经过筛选与采样用于标注工作。 ## [**CrisisMMD v1.0版本数据分布**](https://crisisnlp.qcri.org/data/crisismmd/CrisisMMD_v1.0.tar.gz) | 危机名称 | 推文数量 | 图像数量 | 筛选后推文数 | 采样推文数 | 采样图像数 | |---------------------|----------|----------|--------------|------------|------------| | 飓风厄玛 | 3,517,280 | 176,972 | 5,739 | 4,041 | 4,525 | | 飓风哈维 | 6,664,349 | 321,435 | 19,967 | 4,000 | 4,443 | | 飓风玛丽亚 | 2,953,322 | 52,231 | 6,597 | 4,000 | 4,562 | | 加州山火 | 455,311 | 10,130 | 1,488 | 1,486 | 1,589 | | 墨西哥地震 | 383,341 | 7,111 | 1,241 | 1,239 | 1,382 | | 伊拉克-伊朗地震 | 207,729 | 6,307 | 501 | 499 | 600 | | 斯里兰卡洪水 | 41,809 | 2,108 | 870 | 832 | 1,025 | | **总计** | **14,223,141** | **576,294** | **36,403** | **16,097** | **18,126** | ## 多模态基准模型的数据准备 为开展多模态基准实验,我们将所有事件的推文文本与图像进行合并,共得到24条重复条目(由推文ID、文本与关联图像标识)。经人工核查后,我们保留了标注规范的条目。我们将"不相关或无法判断"标签更改为"非人道主义相关"。此外,原标注中存在"不确定或无法判断"的标签,我们在分类实验中移除了该类样本。经此预处理,共过滤掉39条推文及其对应的44张图像。最终得到的完整数据集共包含16058条推文文本与18082张关联图像,详见下表。本版本数据集即为v2.0,现已公开发布。 ## [**CrisisMMD v2.0版本数据分布**](https://crisisnlp.qcri.org/data/crisismmd/CrisisMMD_v2.0.tar.gz) 本版本中,人道主义任务下的"不相关或无法判断"标签已映射为"非人道主义相关";此外,信息性判别任务中的"非信息性"标签也被映射至人道主义任务的"非人道主义相关"类别。不同事件的重复条目已被移除。 ### 信息性标注分布 | | 推文数量 | 图像数量 | |---------------------|----------|----------| | 信息性 | 11,509 | 9,374 | | 非信息性 | 4,549 | 8,708 | | **总计** | **16,058** | **18,082** | ### 人道主义标注分布 | | 推文数量 | 图像数量 | |---------------------------------------|----------|----------| | 受影响民众 | 472 | 562 | | 基础设施与公用设施损毁 | 1,210 | 3,624 | | 伤亡人员 | 486 | 110 | | 失联或已找到人员 | 40 | 14 | | 非人道主义相关 | 4,549 | 8,708 | | 其他相关信息 | 5,954 | 2,529 | | 救援、志愿或捐赠行动 | 3,293 | 2,231 | | 车辆损毁 | 54 | 304 | | **总计** | **16,058** | **18,082** | ### 损害严重程度标注分布 | | 推文数量 | 图像数量 | |---------------------|----------|----------| | 轻微或无损毁 | - | 475 | | 轻度损毁 | - | 839 | | 严重损毁 | - | 2,212 | | **总计** | - | **3,526** | ## 下载方式(备选渠道) - **CrisisMMD v2.0数据集**:[下载标注数据与关联图像(约1.8GB)](https://crisisnlp.qcri.org/data/crisismmd/CrisisMMD_v2.0.tar.gz) - **数据集划分**:[标注文件下载](https://crisisnlp.qcri.org/data/crisismmd/crisismmd_datasplit_all.zip) - **带一致标注的多模态基准模型数据集划分**:[标注文件下载](https://crisisnlp.qcri.org/data/crisismmd/crisismmd_datasplit_agreed_label.zip) ## 引用声明 如果您在研究中使用本数据集,请引用以下论文: 1. [Ferda Ofli](https://sites.google.com/site/ferdaofli/), [Firoj Alam](https://firojalam.one/), 与 [Muhammad Imran](http://mimran.me/), 《基于多模态深度学习的社交媒体数据分析在灾害响应中的应用》,发表于第17届国际危机响应与管理信息系统会议(ISCRAM),2020年,美国。 2. [Firoj Alam](https://firojalam.one/), [Ferda Ofli](https://sites.google.com/site/ferdaofli/), 与 [Muhammad Imran](http://mimran.me/), 《CrisisMMD:面向自然灾害的多模态Twitter数据集》,发表于第12届国际AAAI网络与社交媒体会议(ICWSM),2018年,美国加利福尼亚州斯坦福。 @InProceedings{crisismmd2018icwsm, author = {Alam, Firoj and Ofli, Ferda and Imran, Muhammad}, title = {{CrisisMMD}: Multimodal Twitter Datasets from Natural Disasters}, booktitle = {Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM)}, year = {2018}, month = {June}, date = {23-28}, location = {USA} } @inproceedings{multimodalbaseline2020, Author = {Ferda Ofli and Firoj Alam and Muhammad Imran}, Booktitle = {17th International Conference on Information Systems for Crisis Response and Management}, Keywords = {Multimodal deep learning, Multimedia content, Natural disasters, Crisis Computing, Social media}, Month = {May}, Organization = {ISCRAM}, Publisher = {ISCRAM}, Title = {Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response}, Year = {2020} }
提供机构:
quanml0703
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作