CrisisMMD
收藏魔搭社区2025-11-27 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/QCRI/CrisisMMD
下载链接
链接失效反馈官方服务:
资源简介:
# CrisisMMD: Multimodal Twitter Datasets from Natural Disasters
The **CrisisMMD** multimodal Twitter dataset consists of several thousand manually annotated tweets and images collected during seven major natural disasters, including earthquakes, hurricanes, wildfires, and floods from 2017. The dataset includes three types of annotations:
On HuggingFace, we hosted version 2.0 of the CrisisMMD dataset. Please see further information below.
### Disaster Response Tasks
1. **Task 1: Informative vs Not Informative**
- Informative
- Not informative
- "Don't know or can't judge" → **Removed in version 2.0**
2. **Task 2: Humanitarian Categories**
- Affected individuals
- Infrastructure and utility damage
- Injured or dead people
- Missing or found people
- Rescue, volunteering, or donation effort
- Vehicle damage
- Other relevant information
- "Not relevant or can't judge" → **Updated to "Not humanitarian" in version 2.0**
3. **Task 3: Damage Severity Assessment**
- Severe damage
- Mild damage
- Little or no damage
- "Don't know or can't judge"
## Datasets Details
The keywords used for collecting tweets, along with the start and end dates for each event, are outlined in the following table.
| Crisis Name | Keywords | Start Date | End Date |
|--------------------|------------------------------------------------|-------------------|-------------------|
| [Hurricane Irma](https://en.wikipedia.org/wiki/Hurricane_Irma) | Hurricane Irma, Irma storm, Storm Irma, etc. | Sep 6, 2017 | Sep 21, 2017 |
| [Hurricane Harvey](https://en.wikipedia.org/wiki/Hurricane_Harvey) | Hurricane Harvey, Tornado, etc. | August 25, 2017 | September 20, 2017|
| [Hurricane Maria](https://en.wikipedia.org/wiki/Hurricane_Maria) | Hurricane Maria, Maria Storm, etc. | September 20, 2017| November 13, 2017 |
| [California wildfires](https://en.wikipedia.org/wiki/List_of_California_wildfires) | California fire, USA Wildfire, etc. | October 10, 2017 | October 27, 2017 |
### Event-wise data distribution
For each event, we collected tweets and associated images, filtered and sampled them for the annotation.
## [**Data distribution from the CrisisMMD version v1.0**](https://crisisnlp.qcri.org/data/crisismmd/CrisisMMD_v1.0.tar.gz)
| Crisis Name | # Tweets | # Images | # Filtered Tweets | # Sampled Tweets | # Sampled Images |
|------------------------|-------------|------------|-------------------|------------------|------------------|
| Hurricane Irma | 3,517,280 | 176,972 | 5,739 | 4,041 | 4,525 |
| Hurricane Harvey | 6,664,349 | 321,435 | 19,967 | 4,000 | 4,443 |
| Hurricane Maria | 2,953,322 | 52,231 | 6,597 | 4,000 | 4,562 |
| California wildfires | 455,311 | 10,130 | 1,488 | 1,486 | 1,589 |
| Mexico earthquake | 383,341 | 7,111 | 1,241 | 1,239 | 1,382 |
| Iraq-Iran earthquake | 207,729 | 6,307 | 501 | 499 | 600 |
| Sri Lanka floods | 41,809 | 2,108 | 870 | 832 | 1,025 |
| **Total** | **14,223,141** | **576,294** | **36,403** | **16,097** | **18,126** |
## Data preparation for multimodal baseline
For the multimodal baseline experiments, we first combined the tweet text and image from all events. It resulted in 24 duplicate entries (tweet ids: text and associated images). We manually checked these duplicate entries and kept the one, which were annotated properly. We changed the label “Not relevant or can’t judge” to “Not humanitarian”. In addition, as the annotation consists of a label - “don't know or can't not judge”, we also removed them for the classification experiments. Hence, this preprocessing part filtered out 39 tweets and associated 44 images. The resulted total dataset consists of 16058 and 18082 tweet texts and images, respectively as shown in the following table. This version of this dataset is released as version 2.0 and is available for download.
## [**Data distribution from the CrisisMMD version v2.0**](https://crisisnlp.qcri.org/data/crisismmd/CrisisMMD_v2.0.tar.gz)
In this version, the "Not relevant or can't judge" label has been mapped to "Not humanitarian" for the humanitarian task. Additionally, the "Not informative" label from the informative task has also been mapped to "Not humanitarian" for the humanitarian task. Duplicate entries from different events have been removed.
### Informativeness
| | Text | Image |
|---------------|--------|--------|
| Informative | 11,509 | 9,374 |
| Not informative | 4,549 | 8,708 |
| **Total** | 16,058 | 18,082 |
### Humanitarian
| | Text | Image |
|-------------------------------|--------|-------|
| Affected individuals | 472 | 562 |
| Infrastructure and utility damage | 1,210 | 3,624 |
| Injured or dead people | 486 | 110 |
| Missing or found people | 40 | 14 |
| Not humanitarian | 4,549 | 8,708 |
| Other relevant information | 5,954 | 2,529 |
| Rescue, volunteering, or donation effort | 3,293 | 2,231 |
| Vehicle damage | 54 | 304 |
| **Total** | 16,058 | 18,082 |
### Damage Severity
| | Text | Image |
|-----------------|------|-------|
| Little or no damage | - | 475 |
| Mild damage | - | 839 |
| Severe damage | - | 2,212 |
| **Total** | - | 3,526 |
## Downloads (Alternate options)
- **CrisisMMD dataset version v2.0**: [Download labeled images and tweets (~1.8GB)](https://crisisnlp.qcri.org/data/crisismmd/CrisisMMD_v2.0.tar.gz)
- **Datasplit**: [Annotations Download](https://crisisnlp.qcri.org/data/crisismmd/crisismmd_datasplit_all.zip)
- **Datasplit for multimodal baseline with agreed labels**: [Annotations Download](https://crisisnlp.qcri.org/data/crisismmd/crisismmd_datasplit_agreed_label.zip)
## Citation
**Please cite the following papers if you use any of these resources in your research.**
1. [Ferda Ofli](https://sites.google.com/site/ferdaofli/), [Firoj Alam](https://firojalam.one/), and [Muhammad Imran](http://mimran.me/), [**Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response**](https://arxiv.org/abs/2004.11838), In Proceedings of the 17th International Conference on Information Systems for Crisis Response and Management (ISCRAM), 2020, USA.
2. [Firoj Alam](https://firojalam.one/), [Ferda Ofli](https://sites.google.com/site/ferdaofli/), and [Muhammad Imran](http://mimran.me/), [**CrisisMMD: Multimodal Twitter Datasets from Natural Disasters**](https://arxiv.org/pdf/1805.00713.pdf), In Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM), 2018, Stanford, California, USA.
```
@InProceedings{crisismmd2018icwsm,
author = {Alam, Firoj and Ofli, Ferda and Imran, Muhammad},
title = {{CrisisMMD}: Multimodal Twitter Datasets from Natural Disasters},
booktitle = {Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM)},
year = {2018},
month = {June},
date = {23-28},
location = {USA}
}
@inproceedings{multimodalbaseline2020,
Author = {Ferda Ofli and Firoj Alam and Muhammad Imran},
Booktitle = {17th International Conference on Information Systems for Crisis Response and Management},
Keywords = {Multimodal deep learning, Multimedia content, Natural disasters, Crisis Computing, Social media},
Month = {May},
Organization = {ISCRAM},
Publisher = {ISCRAM},
Title = {Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response},
Year = {2020}
}
```
# CrisisMMD:面向自然灾害的多模态推特数据集
**CrisisMMD**多模态推特数据集包含了数千条经人工标注的推文与关联图像,这些数据采集自2017年以来的七起重大自然灾害,涵盖地震、飓风、野火与洪水等类型。该数据集包含三类标注任务:
我们已在HuggingFace平台上架了CrisisMMD数据集的2.0版本,详细信息如下。
### 灾害响应标注任务
1. **任务1:信息性分类(有效推文vs无效推文)**
- 有效信息
- 无效信息
- “不确定或无法判断” → **在2.0版本中已移除**
2. **任务2:人道主义相关分类**
- 受灾民众
- 基础设施与公共设施损毁
- 人员伤亡
- 人员失联或已找到
- 救援、志愿或捐赠行动
- 车辆损毁
- 其他相关信息
- “不相关或无法判断” → **在2.0版本中更新为“非人道主义相关”**
3. **任务3:损毁程度评估**
- 严重损毁
- 轻微损毁
- 极少或无损毁
- “不确定或无法判断”
## 数据集详情
用于采集推文的关键词,以及各灾害事件的起止日期,如下表所示。
| 灾害名称 | 采集关键词 | 起始日期 | 结束日期 |
|-------------------------|------------------------------------------------|-----------------|-------------------|
| [飓风艾尔玛(Hurricane Irma)](https://en.wikipedia.org/wiki/Hurricane_Irma) | 飓风艾尔玛、艾尔玛风暴、艾尔玛飓风等 | 2017年9月6日 | 2017年9月21日 |
| [飓风哈维(Hurricane Harvey)](https://en.wikipedia.org/wiki/Hurricane_Harvey) | 飓风哈维、龙卷风等 | 2017年8月25日 | 2017年9月20日|
| [飓风玛丽亚(Hurricane Maria)](https://en.wikipedia.org/wiki/Hurricane_Maria) | 飓风玛丽亚、玛丽亚风暴等 | 2017年9月20日| 2017年11月13日 |
| [加州野火(California wildfires)](https://en.wikipedia.org/wiki/List_of_California_wildfires) | 加州火灾、美国野火等 | 2017年10月10日 | 2017年10月27日 |
### 按灾害事件的数据分布
针对每起灾害事件,我们采集了推文及其关联图像,并经过筛选与采样后用于标注工作。
## [**CrisisMMD v1.0版本数据分布**](https://crisisnlp.qcri.org/data/crisismmd/CrisisMMD_v1.0.tar.gz)
| 灾害名称 | 推文总数 | 图像总数 | 筛选后推文数 | 采样后推文数 | 采样后图像数 |
|------------------------|-------------|------------|-------------------|------------------|------------------|
| 飓风艾尔玛 | 3,517,280 | 176,972 | 5,739 | 4,041 | 4,525 |
| 飓风哈维 | 6,664,349 | 321,435 | 19,967 | 4,000 | 4,443 |
| 飓风玛丽亚 | 2,953,322 | 52,231 | 6,597 | 4,000 | 4,562 |
| 加州野火 | 455,311 | 10,130 | 1,488 | 1,486 | 1,589 |
| 墨西哥地震(Mexico earthquake) | 383,341 | 7,111 | 1,241 | 1,239 | 1,382 |
| 伊拉克-伊朗地震(Iraq-Iran earthquake) | 207,729 | 6,307 | 501 | 499 | 600 |
| 斯里兰卡洪水(Sri Lanka floods) | 41,809 | 2,108 | 870 | 832 | 1,025 |
| **总计** | **14,223,141** | **576,294** | **36,403** | **16,097** | **18,126** |
## 多模态基准模型的数据预处理
在多模态基准模型实验中,我们首先将所有灾害事件的推文文本与关联图像进行合并,最终得到24条重复条目(通过推文ID可识别其文本与关联图像)。我们手动核查了这些重复条目,并保留了标注完整的条目。我们将“不相关或无法判断”的标签更新为“非人道主义相关”。此外,由于原标注中存在“不确定或无法判断”的标签,我们在分类实验中也移除了此类样本。因此,本次预处理共过滤掉39条推文及其关联的44张图像。最终得到的数据集共包含16058条推文文本与18082张关联图像,如下表所示。本数据集版本即为2.0版本,现已开放下载。
## [**CrisisMMD v2.0版本数据分布**](https://crisisnlp.qcri.org/data/crisismmd/CrisisMMD_v2.0.tar.gz)
在本版本中,人道主义分类任务下的“不相关或无法判断”标签已映射为“非人道主义相关”;此外,信息性分类任务中的“无效信息”标签也已映射至人道主义分类任务下的“非人道主义相关”。同时,来自不同灾害事件的重复条目已被移除。
### 信息性分类分布
| | 推文数 | 图像数 |
|---------------|--------|--------|
| 有效信息 | 11,509 | 9,374 |
| 无效信息 | 4,549 | 8,708 |
| **总计** | 16,058 | 18,082 |
### 人道主义分类分布
| | 推文数 | 图像数 |
|-------------------------------|--------|-------|
| 受灾民众 | 472 | 562 |
| 基础设施与公共设施损毁 | 1,210 | 3,624 |
| 人员伤亡 | 486 | 110 |
| 人员失联或已找到 | 40 | 14 |
| 非人道主义相关 | 4,549 | 8,708 |
| 其他相关信息 | 5,954 | 2,529 |
| 救援、志愿或捐赠行动 | 3,293 | 2,231 |
| 车辆损毁 | 54 | 304 |
| **总计** | 16,058 | 18,082 |
### 损毁程度评估分布
| | 推文数 | 图像数 |
|-----------------|------|-------|
| 极少或无损毁 | - | 475 |
| 轻微损毁 | - | 839 |
| 严重损毁 | - | 2,212 |
| **总计** | - | 3,526 |
## 下载渠道(备选方案)
- **CrisisMMD v2.0数据集**:[下载标注后的推文与图像(约1.8GB)](https://crisisnlp.qcri.org/data/crisismmd/CrisisMMD_v2.0.tar.gz)
- **数据集划分**:[标注文件下载](https://crisisnlp.qcri.org/data/crisismmd/crisismmd_datasplit_all.zip)
- **带一致标注的多模态基准模型数据集划分**:[标注文件下载](https://crisisnlp.qcri.org/data/crisismmd/crisismmd_datasplit_agreed_label.zip)
## 引用声明
**若您在研究中使用本数据集,请引用以下论文:**
1. [Ferda Ofli](https://sites.google.com/site/ferdaofli/), [Firoj Alam](https://firojalam.one/), 和 [Muhammad Imran](http://mimran.me/), [**基于多模态深度学习的灾害响应社交媒体数据分析**](https://arxiv.org/abs/2004.11838), 发表于第17届国际危机响应与管理信息系统会议(ISCRAM),2020年,美国。
2. [Firoj Alam](https://firojalam.one/), [Ferda Ofli](https://sites.google.com/site/ferdaofli/), 和 [Muhammad Imran](http://mimran.me/), [**CrisisMMD:面向自然灾害的多模态推特数据集**](https://arxiv.org/pdf/1805.00713.pdf), 发表于第12届国际AAAI社交媒体与网络会议(ICWSM),2018年,美国加利福尼亚州斯坦福。
@InProceedings{crisismmd2018icwsm,
author = {Alam, Firoj and Ofli, Ferda and Imran, Muhammad},
title = {{CrisisMMD}: Multimodal Twitter Datasets from Natural Disasters},
booktitle = {Proceedings of the 12th International AAAI Conference on Web and Social Media (ICWSM)},
year = {2018},
month = {June},
date = {23-28},
location = {USA}
}
@inproceedings{multimodalbaseline2020,
Author = {Ferda Ofli and Firoj Alam and Muhammad Imran},
Booktitle = {17th International Conference on Information Systems for Crisis Response and Management},
Keywords = {Multimodal deep learning, Multimedia content, Natural disasters, Crisis Computing, Social media},
Month = {May},
Organization = {ISCRAM},
Publisher = {ISCRAM},
Title = {Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response},
Year = {2020}
}
提供机构:
maas
创建时间:
2025-06-17



