LORELEI Akan Representative Language Pack

Name: LORELEI Akan Representative Language Pack
Creator: Linguistic Data Consortium
Published: 2021-01-19 15:54:20
License: 暂无描述

DataCite Commons2021-01-19 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2021T02

下载链接

链接失效反馈

官方服务：

资源简介：

Introduction LORELEI Akan Representative Language Pack consists of Akan monolingual text, Akan-English parallel text, annotations, supplemental resources and related software tools developed by LDC for the DARPA LORELEI program. The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation. Data Akan is spoken mainly in Ghana and Ivory Coast. Data was collected in the following genres: discussion forum, news, reference, social network, and weblogs. Both monolingual text collection and parallel text creation involved a combination of manual and automatic methods. Data volumes are as follows: Over 3.3 million words of Akan monolingual text, all of which were translated into English 115,000 Akan words translated from English data Approximately 2,300 words are annotated for named entities, full entity including nominals and pronouns, entity linking, simple semantic annotation, and situation frame annotation, and approximately 2,000 words have morphological segmentation annotation. Lexical resources and software tools are also included in this release. The tools recreate original source data from the processed XML material, condition text data users download from Twitter, apply sentence segmentation to raw text, and support named entity tagging. Monolingual and parallel text are presented in XML with associated dtds. Annotation data is presented as tab delimited files or XML. All text is UTF-8 encoded. The knowledge base for entity linking annotation for this corpus and all LORELEI Representative Language and Incident Language Packs is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10). Acknowledgement This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. Samples Please view the following samples: Akan LTF XML Akan PSM XML English PSM XML English LTF XML Sentence Alignment (XML) Simple Name Entitey Annotation (XML) Full Name Entity Annotation (XML) Semantic Annotatation (XML) Updates None at this time. Copyright Portions © 2002-2007, 2009-2010 Agence France Presse, © 2000 American Broadcasting Company, © 2000 Cable News Network LP, LLLP, © 2008 Central News Agency (Taiwan), © 1989 Dow Jones & Company, Inc., © 2008 Five Colleges, Incorporated, © 2005 Los Angeles Times - Washington Post News Service, Inc., © 2000 National Broadcasting Company, Inc., © 1999, 2005, 2006, 2010 New York Times, © 2017 NY State of Health, © 2000 Public Radio International, © 2003, 2005-2008, 2010 The Associated Press, © 2017 Toronto Community Housing Corporation, © 2011-2017 Watch Tower Bible and Tract Society of Pennsylvania, © 2003, 2005-2008 Xinhua News Agency, © 2021 Trustees of the University of Pennsylvania

引言 LORELEI阿坎语代表语言包由语言数据联盟（LDC）为美国国防高级研究计划局（DARPA）的LORELEI（低资源应急语言计划，Low Resource Languages for Emergent Incidents）计划开发，包含阿坎语（Akan）单语文本、阿坎语-英语平行文本（parallel text）、注释、补充资源及相关软件工具。 LORELEI计划致力于为突发情况（如自然灾害或疾病爆发）中的低资源语言构建人类语言技术。LORELEI的语言资源包括20多种低资源语言的代表语言包和应急语言包，涵盖数据、注释、基础自然语言处理工具、词典及语法资源。代表语言的选择旨在提供广泛的类型学覆盖，而应急语言则用于评估系统在评估开始时才披露身份的语言上的性能。数据阿坎语主要在加纳和科特迪瓦使用。数据收集涵盖以下体裁：论坛讨论、新闻、参考资料、社交网络及博客。单语文本收集和平行文本创建均结合了人工与自动方法。数据量如下： - 超过330万词的阿坎语单语文本，且全部已翻译成英语； - 11.5万词的阿坎语文本（由英语数据翻译而来）。约2300词标注了命名实体（named entity）、完整实体（含名词和代词）、实体链接（entity linking）、简单语义注释及情景框架注释（situation frame annotation），另有约2000词进行了形态切分（morphological segmentation）标注。本发布版本还包含词汇资源和软件工具。这些工具可从处理后的XML材料中重建原始源数据、处理用户从Twitter下载的文本数据、对原始文本进行句子切分，并支持命名实体标注。单语文本和平行文本以XML格式呈现，附有关联的文档类型定义（DTD）文件；注释数据以制表符分隔文件或XML格式呈现。所有文本均采用UTF-8编码。本语料库及所有LORELEI代表语言包和应急语言包的实体链接标注知识库可单独获取，即LORELEI实体检测与链接知识库（LDC2020T10）。致谢本材料基于美国国防高级研究计划局（DARPA）合同号HR0011-15-C-0123资助的工作。本材料中表达的任何观点、发现、结论或建议均为作者个人观点，不一定反映DARPA的观点。样本请查看以下样本： - 阿坎语LTF XML - 阿坎语PSM XML - 英语PSM XML - 英语LTF XML - 句子对齐（XML） - 简单命名实体注释（XML） - 完整命名实体注释（XML） - 语义注释（XML）更新目前无更新。版权部分版权所有 © 2002-2007、2009-2010 法新社（Agence France Presse）；© 2000 美国广播公司（American Broadcasting Company）；© 2000 美国有线电视新闻网（Cable News Network LP, LLLP）；© 2008 台湾中央通讯社（Central News Agency (Taiwan)）；© 1989 道琼斯公司（Dow Jones & Company, Inc.）；© 2008 五校联盟（Five Colleges, Incorporated）；© 2005 洛杉矶时报-华盛顿邮报新闻社（Los Angeles Times - Washington Post News Service, Inc.）；© 2000 美国全国广播公司（National Broadcasting Company, Inc.）；© 1999、2005、2006、2010 《纽约时报》（New York Times）；© 2017 纽约州卫生署（NY State of Health）；© 2000 国际公共广播电台（Public Radio International）；© 2003、2005-2008、2010 美联社（The Associated Press）；© 2017 多伦多社区住房公司（Toronto Community Housing Corporation）；© 2011-2017 宾夕法尼亚州守望台圣经书社（Watch Tower Bible and Tract Society of Pennsylvania）；© 2003、2005-2008 新华社（Xinhua News Agency）；© 2021 宾夕法尼亚大学董事会（Trustees of the University of Pennsylvania）

提供机构：

Linguistic Data Consortium

创建时间：

2021-01-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集