five

CrisisBench-english

收藏
魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/QCRI/CrisisBench-english
下载链接
链接失效反馈
官方服务:
资源简介:
# [CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing](https://ojs.aaai.org/index.php/ICWSM/article/view/18115/17918) The crisis benchmark dataset consists of data from several different sources, such as CrisisLex ([CrisisLex26](http://crisislex.org/data-collections.html#CrisisLexT26), [CrisisLex6](http://crisislex.org/data-collections.html#CrisisLexT6)), [CrisisNLP](https://crisisnlp.qcri.org/lrec2016/lrec2016.html), [SWDM2013](http://mimran.me/papers/imran_shady_carlos_fernando_patrick_practical_2013.pdf), [ISCRAM13](http://mimran.me/papers/imran_shady_carlos_fernando_patrick_iscram2013.pdf), Disaster Response Data (DRD), [Disasters on Social Media (DSM)](https://data.world/crowdflower/disasters-on-social-media), [CrisisMMD](https://crisisnlp.qcri.org/crisismmd), and data from [AIDR](http://aidr.qcri.org/). The purpose of this work was to map the class labels, remove duplicates, and provide benchmark results for the community. ## Dataset This is the set with English languages of the whole CrisisBench dataset. Please check the [CrisisBench Collection](https://huggingface.co/collections/QCRI/crisisbench-672c4b82bcc344d504d775fc) ## Data format Each JSON object contains the following fields: * **id:** Corresponds to the user tweet ID from Twitter. * **event:** Event name associated with the respective dataset. * **source:** Source of the dataset. * **text:** Tweet text. * **lang:** Language tag obtained either from Twitter or from the Google Language Detection API. * **lang_conf:** Confidence score obtained from the Google Language Detection API. In some cases, the tag is marked as "NA," indicating that the language tag was obtained from Twitter rather than the API. * **class_label:** Class label assigned to a given tweet text. ## **Downloads (Alternate Links)** Labeled data and other resources - **Crisis dataset version v1.0:** [Download](https://crisisnlp.qcri.org/data/crisis_datasets_benchmarks/crisis_datasets_benchmarks_v1.0.tar.gz) - **Alternate download link:** [Dataverse](https://doi.org/10.7910/DVN/G98BQG) ## Experimental Scripts: Source code to run the experiments is available at [https://github.com/firojalam/crisis_datasets_benchmarks](https://github.com/firojalam/crisis_datasets_benchmarks) ## License This version of the dataset is distributed under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)**. The full license text can be found in the accompanying `licenses_by-nc-sa_4.0_legalcode.txt` file. ## Citation If you use this data in your research, please consider citing the following paper: [1] Firoj Alam, Hassan Sajjad, Muhammad Imran and Ferda Ofli, CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing, In ICWSM, 2021. [Paper](https://ojs.aaai.org/index.php/ICWSM/article/view/18115/17918) ``` @inproceedings{firojalamcrisisbenchmark2020, Author = {Firoj Alam, Hassan Sajjad, Muhammad Imran, Ferda Ofli}, Keywords = {Social Media, Crisis Computing, Tweet Text Classification, Disaster Response}, Booktitle = {15th International Conference on Web and Social Media (ICWSM)}, Title = {{CrisisBench:} Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing}, Year = {2021} } ``` * and the following associated papers * Muhammad Imran, Prasenjit Mitra, Carlos Castillo. Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages. In Proceedings of the 10th Language Resources and Evaluation Conference (LREC), 2016, Slovenia. * A. Olteanu, S. Vieweg, C. Castillo. 2015. What to Expect When the Unexpected Happens: Social Media Communications Across Crises. In Proceedings of the ACM 2015 Conference on Computer Supported Cooperative Work and Social Computing (CSCW '15). ACM, Vancouver, BC, Canada. * A. Olteanu, C. Castillo, F. Diaz, S. Vieweg. 2014. CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises. In Proceedings of the AAAI Conference on Weblogs and Social Media (ICWSM'14). AAAI Press, Ann Arbor, MI, USA. * Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz and Patrick Meier. Practical Extraction of Disaster-Relevant Information from Social Media. In Social Web for Disaster Management (SWDM'13) - Co-located with WWW, May 2013, Rio de Janeiro, Brazil. * Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz and Patrick Meier. Extracting Information Nuggets from Disaster-Related Messages in Social Media. In Proceedings of the 10th International Conference on Information Systems for Crisis Response and Management (ISCRAM), May 2013, Baden-Baden, Germany. ``` @inproceedings{imran2016lrec, author = {Muhammad Imran and Prasenjit Mitra and Carlos Castillo}, title = {Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages}, booktitle = {Proc. of the LREC, 2016}, year = {2016}, month = {5}, publisher = {ELRA}, address = {Paris, France}, isbn = {978-2-9517408-9-1}, language = {english} } @inproceedings{olteanu2015expect, title={What to expect when the unexpected happens: Social media communications across crises}, author={Olteanu, Alexandra and Vieweg, Sarah and Castillo, Carlos}, booktitle={Proc. of the 18th ACM Conference on Computer Supported Cooperative Work \& Social Computing}, pages={994--1009}, year={2015}, organization={ACM} } @inproceedings{olteanu2014crisislex, title={CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises.}, author={Olteanu, Alexandra and Castillo, Carlos and Diaz, Fernando and Vieweg, Sarah}, booktitle = "Proc. of the 8th ICWSM, 2014", publisher = "AAAI press", year={2014} } @inproceedings{imran2013practical, title={Practical extraction of disaster-relevant information from social media}, author={Imran, Muhammad and Elbassuoni, Shady and Castillo, Carlos and Diaz, Fernando and Meier, Patrick}, booktitle={Proc. of the 22nd WWW}, pages={1021--1024}, year={2013}, organization={ACM} } @inproceedings{imran2013extracting, title={Extracting information nuggets from disaster-related messages in social media}, author={Imran, Muhammad and Elbassuoni, Shady Mamoon and Castillo, Carlos and Diaz, Fernando and Meier, Patrick}, booktitle={Proc. of the 12th ISCRAM}, year={2013} } ```

# CrisisBench:面向人道主义信息处理的危机相关社交媒体数据集基准测试 ([CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing](https://ojs.aaai.org/index.php/ICWSM/article/view/18115/17918)) 本危机基准数据集涵盖多来源数据,包括CrisisLex(含[CrisisLex26](http://crisislex.org/data-collections.html#CrisisLexT26)、[CrisisLex6](http://crisislex.org/data-collections.html#CrisisLexT6))、[CrisisNLP](https://crisisnlp.qcri.org/lrec2016/lrec2016.html)、[SWDM2013](http://mimran.me/papers/imran_shady_carlos_fernando_patrick_practical_2013.pdf)、[ISCRAM13](http://mimran.me/papers/imran_shady_carlos_fernando_patrick_iscram2013.pdf)、灾害响应数据集(Disaster Response Data, DRD)、[社交媒体灾害数据集(Disasters on Social Media, DSM)](https://data.world/crowdflower/disasters-on-social-media)、[CrisisMMD](https://crisisnlp.qcri.org/crisismmd)以及[AIDR](http://aidr.qcri.org/)的数据。 本工作旨在统一类别标签、去除重复样本,并为学界提供基准测试结果。 ## 数据集说明 本子集为完整CrisisBench数据集的英文语种子集,请查阅[CrisisBench数据集合集](https://huggingface.co/collections/QCRI/crisisbench-672c4b82bcc344d504d775fc)。 ## 数据格式 每个JSON对象包含如下字段: * **id:** 对应推文所属的Twitter用户ID。 * **event:** 关联数据集的灾害事件名称。 * **source:** 数据集来源。 * **text:** 推文文本。 * **lang:** 语言标签,可源自Twitter或谷歌语言检测API。 * **lang_conf:** 谷歌语言检测API输出的置信度分数。部分场景下该标签标记为"NA",表示语言标签源自Twitter而非该API。 * **class_label:** 为对应推文文本分配的类别标签。 ## 下载(备用链接) 标注数据及其他资源 - **Crisis数据集v1.0版本:** [下载](https://crisisnlp.qcri.org/data/crisis_datasets_benchmarks/crisis_datasets_benchmarks_v1.0.tar.gz) - **备用下载链接:** [Dataverse](https://doi.org/10.7910/DVN/G98BQG) ## 实验脚本 用于运行实验的源代码可在[https://github.com/firojalam/crisis_datasets_benchmarks](https://github.com/firojalam/crisis_datasets_benchmarks)获取。 ## 授权协议 本数据集版本采用**知识共享署名-非商业性使用-相同方式共享4.0国际许可协议(Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0)**进行分发,完整许可文本可参阅随附的`licenses_by-nc-sa_4.0_legalcode.txt`文件。 ## 引用说明 若您在研究中使用本数据集,请引用以下论文: [1] Firoj Alam、Hassan Sajjad、Muhammad Imran与Ferda Ofli, CrisisBench: 面向人道主义信息处理的危机相关社交媒体数据集基准测试, 发表于ICWSM, 2021. [论文链接](https://ojs.aaai.org/index.php/ICWSM/article/view/18115/17918) @inproceedings{firojalamcrisisbenchmark2020, Author = {Firoj Alam, Hassan Sajjad, Muhammad Imran, Ferda Ofli}, Keywords = {Social Media, Crisis Computing, Tweet Text Classification, Disaster Response}, Booktitle = {15th International Conference on Web and Social Media (ICWSM)}, Title = {{CrisisBench:} Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing}, Year = {2021} } * 及以下相关论文 * Muhammad Imran、Prasenjit Mitra、Carlos Castillo。《Twitter作为生命线:面向危机相关消息自然语言处理的人工标注Twitter语料库》,发表于第10届语言资源与评价会议(Language Resources and Evaluation Conference, LREC)2016年,斯洛文尼亚。 * A. Olteanu、S. Vieweg、C. Castillo。2015年。《当意外发生时:跨危机场景下的社交媒体传播》,发表于2015年计算机支持的协同工作与社会计算会议(ACM Conference on Computer Supported Cooperative Work and Social Computing, CSCW '15),ACM,加拿大不列颠哥伦比亚省温哥华。 * A. Olteanu、C. Castillo、F. Diaz、S. Vieweg。2014年。《CrisisLex:用于危机场景下收集与过滤微博消息的词典》,发表于第8届国际博客与社交媒体会议(AAAI Conference on Weblogs and Social Media, ICWSM'14),AAAI出版社,美国密歇根州安阿伯市。 * Muhammad Imran、Shady Elbassuoni、Carlos Castillo、Fernando Diaz与Patrick Meier。《从社交媒体中高效提取灾害相关信息》,发表于2013年灾害管理社交网络研讨会(Social Web for Disaster Management, SWDM'13,与WWW大会同期举办),2013年5月,巴西里约热内卢。 * Muhammad Imran、Shady Elbassuoni、Carlos Castillo、Fernando Diaz与Patrick Meier。《从社交媒体灾害相关消息中提取信息片段》,发表于第12届国际危机响应与管理信息系统会议(International Conference on Information Systems for Crisis Response and Management, ISCRAM)2013年5月,德国巴登-巴登。 @inproceedings{imran2016lrec, author = {Muhammad Imran and Prasenjit Mitra and Carlos Castillo}, title = {Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages}, booktitle = {Proc. of the LREC, 2016}, year = {2016}, month = {5}, publisher = {ELRA}, address = {Paris, France}, isbn = {978-2-9517408-9-1}, language = {english} } @inproceedings{olteanu2015expect, title={What to expect when the unexpected happens: Social media communications across crises}, author={Olteanu, Alexandra and Vieweg, Sarah and Castillo, Carlos}, booktitle={Proc. of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing}, pages={994--1009}, year={2015}, organization={ACM} } @inproceedings{olteanu2014crisislex, title={CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises.}, author={Olteanu, Alexandra and Castillo, Carlos and Diaz, Fernando and Vieweg, Sarah}, booktitle = "Proc. of the 8th ICWSM, 2014", publisher = "AAAI press", year={2014} } @inproceedings{imran2013practical, title={Practical extraction of disaster-relevant information from social media}, author={Imran, Muhammad and Elbassuoni, Shady and Castillo, Carlos and Diaz, Fernando and Meier, Patrick}, booktitle={Proc. of the 22nd WWW}, pages={1021--1024}, year={2013}, organization={ACM} } @inproceedings{imran2013extracting, title={Extracting information nuggets from disaster-related messages in social media}, author={Imran, Muhammad and Elbassuoni, Shady Mamoon and Castillo, Carlos and Diaz, Fernando and Meier, Patrick}, booktitle={Proc. of the 12th ISCRAM}, year={2013} }
提供机构:
maas
创建时间:
2025-06-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作