Humanitarian Assistance and Disaster Relief (HA/DR) Articles and Lexicon
收藏DataONE2018-08-28 更新2024-06-08 收录
下载链接:
https://search.dataone.org/view/sha256:06ebf39beeb36c24937442f95a9c9ae07282654d810143ccf440baad64405e44
下载链接
链接失效反馈官方服务:
资源简介:
ReliefWeb HA/DR Article Corpus This corpus consists of ~504K newswire text harvested from ReliefWeb.int, an aggregator of HA/DR news articles and analysis sponsored by the United Nations Office for the Coordination of Humanitarian Affairs (OCHA)]. The corpus is over 300M total words, with documents primarily in English (85%), with some French (9%) and Spanish (6%). The documents are natively annotated for disaster type and 'theme'; see the ReliefWeb Taxonomy for descriptions of each. Approximately 28% articles are marked for one or more disaster types and a disaster name (e.g., \"Myanmar: Tropical Cyclone Nargis - May 2008\"), and just under half (45%) are annotated for a theme. Data Citation The corpus and lexicon were constructed by Leidos Corp. under funding from the Defense Advanced Research Projects Agency (DARPA) Information Innovation Office (I2O), program: Low Resource Languages for Emergent Incidents (LORELEI), issued by DARPA/I2O under Contract No. HR0011-15-C-0114. The data was originally privately distributed for performers within that program. Any usage of the dataset should cite the following paper describing its construction: Littell, P., Tian, T., Xu, R. et al. (2018) The ARIEL-CMU situation frame detection pipeline for LoReHLT16: a model translation approach. Machine Translation 32: 105. https://doi.org/10.1007/s10590-017-9205-3. The data was originally released upon publication of the following paper: Gallagher, Ryan J., et al. \"Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge.\" Transactions of the Association for Computational Linguistics (2017). Corpus Format The data are in JSON format; each article consists of the following fields. `id`: A unique id. `title`: The original article title. `text`: The body text of the article. `date_created`: Date created on ReliefWeb (ISO 8601). `country_name`: The primary country of the disaster event (i.e., the country in which the event occurred). `country_location`: The geographical coordinates of the affected country. `disaster_name`: A list of short descriptions of the disaster events described in the article. `disaster_type`: A list of types, according to the ReliefWeb taxonomy. `glide`: A list of disaster event GLIDE numbers. `theme`: A list of relief topics, according to the ReliefWeb taxonomy. `source_name`: The name of the original publishing organization. `source_type`: The organization type of the publisher (i.e., media, gov't, NGO, etc.) `href`: ReliefWeb API url of article. HA/DR Topic Lexicon This lexicon contains ~34K English language terms (words and multi-word expressions) semantically relevant to the HA/DR topic taxonomy devised by DARPA and the LORELEI evaluation team. The lexicon is intended to support lexical transfer from high-resource (e.g., English) to low-resource languages, particularly for topic modeling and elicitation of domain-specific translations. Format The lexicon is formatted as a single JSON file. Each entry contains the following fields. The entries are sorted by topic, relevance, frequency, and distance. `topic` : The HA/DR topic to which the term belongs, e.g., Violent Civil Unrest, Water, etc. `term` : A word or multi-word expression related to the topic. `seed` : Boolean; whether or not the term was originally identified by a HA/DR expert as highly relevant to the topic. `len` : The length of the term, in words. `dist` : The cosine distance of the term to the topic, averaged over five vector space models. `relevance` : The three-auditor average of the term's relevance to the topic on a 5 point Likert scale. `freq` : The frequency of the term in the ReliefWeb corpus. `example` : A sentence from the ReliefWeb corpus containing the term, if available. NB: While the sentence is likely to relate to the topic, it is not guaranteed to; it may only be generically HA/DR relevant. Construction The lexicon was developed with a semi-supervised extraction process: 1) A set of seed terms for each defined topic area was constructed manually with the input of an HA/DR domain expert. Additional terms from CrisisLex's CrisisLexRec and EMTerms lexicons were included in these sets. The seed lists were typically between 40-60 terms per topic. 2) For each set of seed terms, a set of candidate terms was generated with a set of word2vec models: * A word2vec model trained on all HA/DR documents collected for by ADRIEL for LORELEI. * A word2vec model trained on over one billion English language tweets available on the Internet Archive. * The pre-trained Google News word2vec vectors. 3) Candidates were filtered to remove commonly occurring given names, surnames, and place names (taken from dbpedia), expanded with WordNet synonyms and hyponyms, and finally filtered according to their semantic distance from seed terms using an ensemble of the word2vec models above and the more traditional vector space models below: * A singular-value decomposition of dependency path features constructed from the HA/DR documents with Stanford's CoreNLP dependency parser. * A latent semantic indexing model of an English language thesaurus. 4) From a set of ~15K candidates per topic, 3K \"semantically near\" terms were selected in this manner for each topic. 5) Finally, a variety of low level text filters were applied to remove, e.g., non-ASCII terms, terms of 3 or fewer characters, and terms with non-word punctuation. Auditing All extracted terms were audited with CrowdFlower. Contributors were asked to rate each term's relevance to the topic on a five point Likert scale, with extreme points on the scale described as indicating a-contextual relevance (i.e., \"sewage\" is necessarily relevant to Sanitation without any additional context) or irrelevance (i.e., it is difficult to imagine how \"bubblegum\" would be relevant to Extreme Violence/Terrorism), and the mid-range indicating contextual dependence (i.e., \"water\" can be relevant to a discussion of Energy in the context of hydroelectricity plants). Terms receiving an average relevance of 3.5 or lower were dropped from the final lexicon. Overall agreement among participants on the rating scale was 75%. Contributors were required correctly label a set of 50 researcher-defined sample questions before participating in the auditing; contributors scoring less than 70% were not allowed to participate.
创建时间:
2023-11-22



