Chinese <-> English Name Entity Lists v 1.0

Name: Chinese English Name Entity Lists v 1.0
Creator: Linguistic Data Consortium
Published: 2021-07-01 16:18:09
License: 暂无描述

DataCite Commons2021-07-01 更新2025-04-16 收录

下载链接：

https://catalog.ldc.upenn.edu/LDC2005T34

下载链接

链接失效反馈

官方服务：

资源简介：

<h3>Introduction</h3> This file contains documentation for Chinese English Name Entity Lists, v.1.0, Linguistic Data Consortium (LDC) catalog number LDC2005T34 and ISBN 1-58563-368-2. These Chinese-English bi-directional name entity lists are compiled from Xinhua News Agency newswire texts. Not every irregularity in the original source has been detected and normalized. Some Chinese characters are not encoded in the source and brackets are used to describe their composition. Except for the person name lists, most instances were left untouched in the created lists. An effort was made to replace GB-encoded characters (such as Roman numbers) in the English translation with ASCII characters. However no attempt has been made to do the opposite for Chinese names. The use of slashes as delimiters presents another problem. Some names may have internal slashes. Initially, double quotes ("") were used to enclose the name with an internal slash to avoid confusion without realizing that these is just one " in ASCII (as opposed to a set of enclosing " in GB). Later it was decided to use &slash;. In future releases, some lists will be changed for greater consistency. Finally, most of the English names in the source use lower cases throughout. An effort was made to capitalize the initial letter (and possibly some middle ones) for person names, but not for any other kind of names as most other names have multiple words, some of which may contain articles and prepositions. The word "English" is somewhat misleading here. Although most of the foreign words are English or can appear in English texts, there are also many non-English words written in Roman alphabet - some of which may have English equivalents while others do not. No efforts have been made to eliminate those non-English names where English equivlants are available. The entire set consists of nine pairs of lists. The English->Chinese version of each pair was created by reversing the Chinese->English, both sorted by the Unix built-in sort function. The contents are as follows <ul> <li>Place Names, Chinese to English: 276,382</li> <li>Place Names, English to Chinese: 298,993</li> <li>Organization Names, Chinese to English: 30,800</li> <li>Organization Names, English to Chinese: 37,145</li> <li>Corporate Names, Chinese to English: 54,747</li> <li>Corporate Names, English to Chinese 58,468</li> <li>Press Organization Names, Chinese to English: 29,757</li> <li>Press Organization Names, English to Chinese: 32,922</li> <li>Intl. Organization Names, Chinese to English: 7,040</li> <li>Intl. Organization Names, English to Chinese: 7,040</li> </ul> <h3>Samples</h3> For an example of the data in this publication, please view this <a href="desc/addenda/LDC2005T34.jpg" rel="nofollow">screen capture</a> of the corporate names list. <h3>Additional Licensing Instructions</h3> This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact <a href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a> for information about becoming a member. Portions © 2001 Xinhua News Agency, © 2002, 2005 Trustees of the University of Pennsylvania

提供机构：

Linguistic Data Consortium

创建时间：

2020-11-30

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是中文与英文之间的双向命名实体列表，版本1.0，由Linguistic Data Consortium (LDC)发布，基于新华通讯社新闻文本编译。它包含九对列表，涵盖地名、组织名、公司名、新闻机构名和国际组织名，每对列表提供中文到英文和英文到中文的映射，条目总数从数千到数十万不等，但数据存在原始不一致性（如字符编码和分隔符问题）且未完全规范化。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集