five

Humanitarian Assistance and Disaster Relief (HA/DR) Articles and Lexicon

收藏
NIAID Data Ecosystem2026-03-10 收录
下载链接:
https://doi.org/10.7910/DVN/TGOPRU
下载链接
链接失效反馈
官方服务:
资源简介:
ReliefWeb HA/DR Article Corpus This corpus consists of ~504K newswire text harvested from ReliefWeb.int, an aggregator of HA/DR news articles and analysis sponsored by the United Nations Office for the Coordination of Humanitarian Affairs (OCHA)]. The corpus is over 300M total words, with documents primarily in English (85%), with some French (9%) and Spanish (6%). The documents are natively annotated for disaster type and 'theme'; see the ReliefWeb Taxonomy for descriptions of each. Approximately 28% articles are marked for one or more disaster types and a disaster name (e.g., "Myanmar: Tropical Cyclone Nargis - May 2008"), and just under half (45%) are annotated for a theme. Data Citation The corpus and lexicon were constructed by Leidos Corp. under funding from the Defense Advanced Research Projects Agency (DARPA) Information Innovation Office (I2O), program: Low Resource Languages for Emergent Incidents (LORELEI), issued by DARPA/I2O under Contract No. HR0011-15-C-0114. The data was originally privately distributed for performers within that program. Any usage of the dataset should cite the following paper describing its construction: Littell, P., Tian, T., Xu, R. et al. (2018) The ARIEL-CMU situation frame detection pipeline for LoReHLT16: a model translation approach. Machine Translation 32: 105. https://doi.org/10.1007/s10590-017-9205-3. The data was originally released upon publication of the following paper: Gallagher, Ryan J., et al. "Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge." Transactions of the Association for Computational Linguistics (2017). Corpus Format The data are in JSON format; each article consists of the following fields. `id`: A unique id. `title`: The original article title. `text`: The body text of the article. `date_created`: Date created on ReliefWeb (ISO 8601). `country_name`: The primary country of the disaster event (i.e., the country in which the event occurred). `country_location`: The geographical coordinates of the affected country. `disaster_name`: A list of short descriptions of the disaster events described in the article. `disaster_type`: A list of types, according to the ReliefWeb taxonomy. `glide`: A list of disaster event GLIDE numbers. `theme`: A list of relief topics, according to the ReliefWeb taxonomy. `source_name`: The name of the original publishing organization. `source_type`: The organization type of the publisher (i.e., media, gov't, NGO, etc.) `href`: ReliefWeb API url of article. HA/DR Topic Lexicon This lexicon contains ~34K English language terms (words and multi-word expressions) semantically relevant to the HA/DR topic taxonomy devised by DARPA and the LORELEI evaluation team. The lexicon is intended to support lexical transfer from high-resource (e.g., English) to low-resource languages, particularly for topic modeling and elicitation of domain-specific translations. Format The lexicon is formatted as a single JSON file. Each entry contains the following fields. The entries are sorted by topic, relevance, frequency, and distance. `topic` : The HA/DR topic to which the term belongs, e.g., Violent Civil Unrest, Water, etc. `term` : A word or multi-word expression related to the topic. `seed` : Boolean; whether or not the term was originally identified by a HA/DR expert as highly relevant to the topic. `len` : The length of the term, in words. `dist` : The cosine distance of the term to the topic, averaged over five vector space models. `relevance` : The three-auditor average of the term's relevance to the topic on a 5 point Likert scale. `freq` : The frequency of the term in the ReliefWeb corpus. `example` : A sentence from the ReliefWeb corpus containing the term, if available. NB: While the sentence is likely to relate to the topic, it is not guaranteed to; it may only be generically HA/DR relevant. Construction The lexicon was developed with a semi-supervised extraction process: 1) A set of seed terms for each defined topic area was constructed manually with the input of an HA/DR domain expert. Additional terms from CrisisLex's CrisisLexRec and EMTerms lexicons were included in these sets. The seed lists were typically between 40-60 terms per topic. 2) For each set of seed terms, a set of candidate terms was generated with a set of word2vec models: * A word2vec model trained on all HA/DR documents collected for by ADRIEL for LORELEI. * A word2vec model trained on over one billion English language tweets available on the Internet Archive. * The pre-trained Google News word2vec vectors. 3) Candidates were filtered to remove commonly occurring given names, surnames, and place names (taken from dbpedia), expanded with WordNet synonyms and hyponyms, and finally filtered according to their semantic distance from seed terms using an ensemble of the word2vec models above and the more traditional vector space models below: * A singular-value decomposition of dependency path features...
创建时间:
2018-08-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作