five

CrisisBench-all-lang

收藏
魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/QCRI/CrisisBench-all-lang
下载链接
链接失效反馈
官方服务:
资源简介:
# [CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing](https://ojs.aaai.org/index.php/ICWSM/article/view/18115/17918) The crisis benchmark dataset consists of data from several different sources, such as CrisisLex ([CrisisLex26](http://crisislex.org/data-collections.html#CrisisLexT26), [CrisisLex6](http://crisislex.org/data-collections.html#CrisisLexT6)), [CrisisNLP](https://crisisnlp.qcri.org/lrec2016/lrec2016.html), [SWDM2013](http://mimran.me/papers/imran_shady_carlos_fernando_patrick_practical_2013.pdf), [ISCRAM13](http://mimran.me/papers/imran_shady_carlos_fernando_patrick_iscram2013.pdf), Disaster Response Data (DRD), [Disasters on Social Media (DSM)](https://data.world/crowdflower/disasters-on-social-media), [CrisisMMD](https://crisisnlp.qcri.org/crisismmd), and data from [AIDR](http://aidr.qcri.org/). The purpose of this work was to map the class labels, remove duplicates, and provide benchmark results for the community. ## Dataset This is the set with multiple languages of the whole CrisisBench dataset. Please check the [CrisisBench Collection](https://huggingface.co/collections/QCRI/crisisbench-672c4b82bcc344d504d775fc) ## Data format Each JSON object contains the following fields: * **id:** Corresponds to the user tweet ID from Twitter. * **event:** Event name associated with the respective dataset. * **source:** Source of the dataset. * **text:** Tweet text. * **lang:** Language tag obtained either from Twitter or from the Google Language Detection API. * **lang_conf:** Confidence score obtained from the Google Language Detection API. In some cases, the tag is marked as "NA," indicating that the language tag was obtained from Twitter rather than the API. * **class_label:** Class label assigned to a given tweet text. ## **Downloads (Alternate Links)** Labeled data and other resources - **Crisis dataset version v1.0:** [Download](https://crisisnlp.qcri.org/data/crisis_datasets_benchmarks/crisis_datasets_benchmarks_v1.0.tar.gz) - **Alternate download link:** [Dataverse](https://doi.org/10.7910/DVN/G98BQG) ## Experimental Scripts: Source code to run the experiments is available at [https://github.com/firojalam/crisis_datasets_benchmarks](https://github.com/firojalam/crisis_datasets_benchmarks) ## License This version of the dataset is distributed under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)**. The full license text can be found in the accompanying `licenses_by-nc-sa_4.0_legalcode.txt` file. ## Citation If you use this data in your research, please consider citing the following paper: [1] Firoj Alam, Hassan Sajjad, Muhammad Imran and Ferda Ofli, CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing, In ICWSM, 2021. [Paper](https://ojs.aaai.org/index.php/ICWSM/article/view/18115/17918) ``` @inproceedings{firojalamcrisisbenchmark2020, Author = {Firoj Alam, Hassan Sajjad, Muhammad Imran, Ferda Ofli}, Keywords = {Social Media, Crisis Computing, Tweet Text Classification, Disaster Response}, Booktitle = {15th International Conference on Web and Social Media (ICWSM)}, Title = {{CrisisBench:} Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing}, Year = {2021} } ``` * and the following associated papers * Muhammad Imran, Prasenjit Mitra, Carlos Castillo. Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages. In Proceedings of the 10th Language Resources and Evaluation Conference (LREC), 2016, Slovenia. * A. Olteanu, S. Vieweg, C. Castillo. 2015. What to Expect When the Unexpected Happens: Social Media Communications Across Crises. In Proceedings of the ACM 2015 Conference on Computer Supported Cooperative Work and Social Computing (CSCW '15). ACM, Vancouver, BC, Canada. * A. Olteanu, C. Castillo, F. Diaz, S. Vieweg. 2014. CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises. In Proceedings of the AAAI Conference on Weblogs and Social Media (ICWSM'14). AAAI Press, Ann Arbor, MI, USA. * Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz and Patrick Meier. Practical Extraction of Disaster-Relevant Information from Social Media. In Social Web for Disaster Management (SWDM'13) - Co-located with WWW, May 2013, Rio de Janeiro, Brazil. * Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz and Patrick Meier. Extracting Information Nuggets from Disaster-Related Messages in Social Media. In Proceedings of the 10th International Conference on Information Systems for Crisis Response and Management (ISCRAM), May 2013, Baden-Baden, Germany. ``` @inproceedings{imran2016lrec, author = {Muhammad Imran and Prasenjit Mitra and Carlos Castillo}, title = {Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages}, booktitle = {Proc. of the LREC, 2016}, year = {2016}, month = {5}, publisher = {ELRA}, address = {Paris, France}, isbn = {978-2-9517408-9-1}, language = {english} } @inproceedings{olteanu2015expect, title={What to expect when the unexpected happens: Social media communications across crises}, author={Olteanu, Alexandra and Vieweg, Sarah and Castillo, Carlos}, booktitle={Proc. of the 18th ACM Conference on Computer Supported Cooperative Work \& Social Computing}, pages={994--1009}, year={2015}, organization={ACM} } @inproceedings{olteanu2014crisislex, title={CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises.}, author={Olteanu, Alexandra and Castillo, Carlos and Diaz, Fernando and Vieweg, Sarah}, booktitle = "Proc. of the 8th ICWSM, 2014", publisher = "AAAI press", year={2014} } @inproceedings{imran2013practical, title={Practical extraction of disaster-relevant information from social media}, author={Imran, Muhammad and Elbassuoni, Shady and Castillo, Carlos and Diaz, Fernando and Meier, Patrick}, booktitle={Proc. of the 22nd WWW}, pages={1021--1024}, year={2013}, organization={ACM} } @inproceedings{imran2013extracting, title={Extracting information nuggets from disaster-related messages in social media}, author={Imran, Muhammad and Elbassuoni, Shady Mamoon and Castillo, Carlos and Diaz, Fernando and Meier, Patrick}, booktitle={Proc. of the 12th ISCRAM}, year={2013} } ```

# [CrisisBench:面向人道主义信息处理的危机相关社交媒体数据集基准测试](https://ojs.aaai.org/index.php/ICWSM/article/view/18115/17918) 本危机基准数据集整合了多源数据,涵盖CrisisLex(CrisisLex)系列(含[CrisisLex26](http://crisislex.org/data-collections.html#CrisisLexT26)、[CrisisLex6](http://crisislex.org/data-collections.html#CrisisLexT6))、CrisisNLP(CrisisNLP)、[SWDM2013](http://mimran.me/papers/imran_shady_carlos_fernando_patrick_practical_2013.pdf)、[ISCRAM13](http://mimran.me/papers/imran_shady_carlos_fernando_patrick_iscram2013.pdf)、灾害响应数据集(Disaster Response Data, DRD)、[Disasters on Social Media (DSM)](https://data.world/crowdflower/disasters-on-social-media)、[CrisisMMD(CrisisMMD)](https://crisisnlp.qcri.org/crisismmd)以及[AIDR(AIDR)](http://aidr.qcri.org/)相关数据。 本工作的核心目标为统一类别标签体系、去除重复样本,并为学界提供基准测试结果。 ## 数据集说明 本子集为完整CrisisBench(CrisisBench)数据集的多语言版本,完整数据集请参见[CrisisBench数据集集合](https://huggingface.co/collections/QCRI/crisisbench-672c4b82bcc344d504d775fc) ## 数据格式 每个JSON对象均包含以下字段: * **id**:对应推文发布者的Twitter推文ID。 * **event**:关联当前数据集的灾害事件名称。 * **source**:数据集来源。 * **text**:推文原文。 * **lang**:语言标签,可通过Twitter平台或Google语言检测API获取。 * **lang_conf**:Google语言检测API返回的置信度得分。部分场景下该标签标记为"NA",表示语言标签源自Twitter平台而非API。 * **class_label**:为对应推文文本分配的类别标签。 ## 下载(备用链接) 标注数据及其他资源: - **危机数据集v1.0版本**:[下载](https://crisisnlp.qcri.org/data/crisis_datasets_benchmarks/crisis_datasets_benchmarks_v1.0.tar.gz) - **备用下载链接**:[Dataverse](https://doi.org/10.7910/DVN/G98BQG) ## 实验脚本 用于运行实验的源代码可从[https://github.com/firojalam/crisis_datasets_benchmarks](https://github.com/firojalam/crisis_datasets_benchmarks)获取。 ## 授权协议 本版本数据集采用**知识共享署名-非商业性使用-相同方式共享4.0国际许可协议(Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0)**进行分发。完整许可协议文本可参见附带的`licenses_by-nc-sa_4.0_legalcode.txt`文件。 ## 引用说明 若您在研究中使用本数据集,请引用以下论文: [1] Firoj Alam、Hassan Sajjad、Muhammad Imran与Ferda Ofli,CrisisBench:面向人道主义信息处理的危机相关社交媒体数据集基准测试,发表于国际博客与社交媒体会议(International Conference on Web and Social Media, ICWSM),2021年。[论文链接](https://ojs.aaai.org/index.php/ICWSM/article/view/18115/17918) @inproceedings{firojalamcrisisbenchmark2020, Author = {Firoj Alam, Hassan Sajjad, Muhammad Imran, Ferda Ofli}, Keywords = {Social Media, Crisis Computing, Tweet Text Classification, Disaster Response}, Booktitle = {15th International Conference on Web and Social Media (ICWSM)}, Title = {{CrisisBench:} Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing}, Year = {2021} } * 同时请引用以下关联论文: * Muhammad Imran、Prasenjit Mitra与Carlos Castillo,《Twitter作为生命线:面向危机相关消息自然语言处理的人工标注Twitter语料库》,发表于第10届语言资源与评价国际会议(Language Resources and Evaluation Conference, LREC),2016年,斯洛文尼亚。 * A. Olteanu、S. Vieweg与C. Castillo,2015年,《当意外发生时:跨危机场景的社交媒体沟通》,发表于ACM 2015年计算机支持协同工作与社会计算会议(ACM 2015 Conference on Computer Supported Cooperative Work and Social Computing, CSCW '15),ACM出版社,加拿大不列颠哥伦比亚省温哥华。 * A. Olteanu、C. Castillo、F. Diaz与S. Vieweg,2014年,《CrisisLex:用于危机场景下收集与过滤微博消息的词典》,发表于第八届国际博客与社交媒体会议(AAAI Conference on Weblogs and Social Media, ICWSM'14),AAAI出版社,美国密歇根州安娜堡。 * Muhammad Imran、Shady Elbassuoni、Carlos Castillo、Fernando Diaz与Patrick Meier,《从社交媒体中实际提取灾害相关信息》,发表于2013年5月于巴西里约热内卢举办的社会网络助力灾害管理研讨会(Social Web for Disaster Management, SWDM'13,与WWW大会联合举办)。 * Muhammad Imran、Shady Elbassuoni、Carlos Castillo、Fernando Diaz与Patrick Meier,《从社交媒体灾害相关消息中提取信息片段》,发表于2013年5月于德国巴登-巴登举办的第12届国际危机响应与管理信息系统会议(International Conference on Information Systems for Crisis Response and Management, ISCRAM)。 @inproceedings{imran2016lrec, author = {Muhammad Imran and Prasenjit Mitra and Carlos Castillo}, title = {Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages}, booktitle = {Proc. of the LREC, 2016}, year = {2016}, month = {5}, publisher = {ELRA}, address = {Paris, France}, isbn = {978-2-9517408-9-1}, language = {english} } @inproceedings{olteanu2015expect, title={What to expect when the unexpected happens: Social media communications across crises}, author={Olteanu, Alexandra and Vieweg, Sarah and Castillo, Carlos}, booktitle={Proc. of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing}, pages={994--1009}, year={2015}, organization={ACM} } @inproceedings{olteanu2014crisislex, title={CrisisLex: A Lexicon for Collecting and Filtering Microblogged Communications in Crises.}, author={Olteanu, Alexandra and Castillo, Carlos and Diaz, Fernando and Vieweg, Sarah}, booktitle = "Proc. of the 8th ICWSM, 2014", publisher = "AAAI press", year={2014} } @inproceedings{imran2013practical, title={Practical extraction of disaster-relevant information from social media}, author={Imran, Muhammad and Elbassuoni, Shady and Castillo, Carlos and Diaz, Fernando and Meier, Patrick}, booktitle={Proc. of the 22nd WWW}, pages={1021--1024}, year={2013}, organization={ACM} } @inproceedings{imran2013extracting, title={Extracting information nuggets from disaster-related messages in social media}, author={Imran, Muhammad and Elbassuoni, Shady Mamoon and Castillo, Carlos and Diaz, Fernando and Meier, Patrick}, booktitle={Proc. of the 12th ISCRAM}, year={2013} }
提供机构:
maas
创建时间:
2025-06-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作