LORELEI Hindi Representative Language Pack

Name: LORELEI Hindi Representative Language Pack
Creator: Linguistic Data Consortium
Published: 2025-09-10 20:27:06
License: 暂无描述

DataCite Commons2025-09-10 更新2026-05-03 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2025T12

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> <p>LORELEI Hindi Representative Language Pack (LDC2025T12) consists of Hindi monolingual text, Hindi-English parallel text, annotations, supplemental resources and related software tools developed by the Linguistic Data Consortium for the DARPA LORELEI program.</p> <p>The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.</p> <h3>Data</h3> <p>Hindi is spoken mainly in India, where it is an official language, and also by communities in South Africa, the United Arab Emirates, Fiji, Mauritius, Suriname, Nepal, the United Kingdom, and the United States. Data was collected in the following genres: discussion forum, news, reference, social network, and weblogs. Both monolingual text collection and parallel text creation involved a combination of manual and automatic methods.</p> <p>Data volumes are as follows:</p> <ul> <li>26 million words of Hindi monolingual text, 363,000 words of which were translated into English</li> <li>1.07 million words of found Hindi-English parallel text</li> <li>118,000 Hindi words translated from English data</li> </ul> <p>Approximately 103,000 words were annotated for simple named entities; over 25,000 words are annotated for full entity (including nominals and pronouns), entity linking, simple semantics, and situation frames.</p> <p>Lexical resources and software tools are also included in this release. The tools recreate original source data from the processed XML material, condition text data users download from Twitter/X, apply sentence segmentation to raw text, and support named entity tagging.</p> <p>Monolingual and parallel text are presented in XML with associated dtds. Annotation data is presented as tab delimited files or XML. All text is UTF-8 encoded.</p> <p>The knowledge base for entity linking annotation for this corpus and all LORELEI Representative Language and Incident Language Packs is available separately as <a href="../../../LDC2020T10">LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10)</a>.</p> <h3>Sponsorship</h3> <p>This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA.</p>

<h3>简介</h3> <p>LORELEI印地语代表性语言包（LDC2025T12）由语言数据联盟（Linguistic Data Consortium, LDC）为美国国防高级研究计划局（Defense Advanced Research Projects Agency, DARPA）的LORELEI项目开发，包含印地语单语文本、印英对照文本、标注数据、补充资源及相关软件工具。</p> <p>LORELEI项目全称为「突发事件低资源语言」（Low Resource Languages for Emergent Incidents），旨在针对自然灾害、疾病暴发等突发场景，开发适用于低资源语言的人类语言技术。该项目的语言资源包含面向二十余种低资源语言的代表性语言包与事件语言包，涵盖数据、标注、基础自然语言处理工具、词典及语法资源。其中，代表性语言的选取兼顾类型学覆盖广度，而事件语言则用于在评估阶段对已明确语言身份的系统性能进行评测。</p> <h3>数据</h3> <p>印地语主要作为官方语言通行于印度，同时南非、阿联酋、斐济、毛里求斯、苏里南、尼泊尔、英国及美国等地的社群也广泛使用该语言。本次采集的数据涵盖以下文本体裁：论坛讨论、新闻、参考资料、社交网络及博客。单语文本采集与对照文本创建均结合了人工与自动两种方法。</p> <p>数据规模如下：</p> <ul> <li>2600万词印地语单语文本，其中36.3万词已被译为英文</li> <li>107万词采集得到的印英对照文本</li> <li>11.8万词印地语文本由英文数据翻译得到</li> </ul> <p>约10.3万词被标注了简单命名实体；超过2.5万词完成了全实体（包括名词性成分与代词）标注、实体链接、简单语义及场景框架标注。</p> <p>本次发布还包含词汇资源与软件工具。这些工具可从处理后的XML源数据还原原始文本，对用户从Twitter/X下载的文本数据进行预处理，对原始文本执行分句，并支持命名实体标注。</p> <p>单语与对照文本以XML格式存储，并配有对应的文档类型定义（Document Type Definition, DTD）。标注数据以制表符分隔文件或XML格式提供。所有文本均采用UTF-8编码。</p> <p>本语料库及所有LORELEI代表性语言包、事件语言包所用的实体链接标注知识库，可通过<a href="../../../LDC2020T10">LORELEI实体检测与链接知识库（LDC2020T10）</a>单独获取。</p> <h3>资助说明</h3> <p>本材料基于美国国防高级研究计划局（Defense Advanced Research Projects Agency, DARPA）合同HR0011-15-C-0123资助的研究工作。本材料中表达的任何观点、发现、结论或建议均为作者个人观点，未必反映DARPA的官方立场。</p>

提供机构：

Linguistic Data Consortium

创建时间：

2025-09-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集