ParaNames
收藏arXiv2022-07-13 更新2024-06-21 收录
下载链接:
https://github.com/bltlab/paranames
下载链接
链接失效反馈官方服务:
资源简介:
ParaNames是一个大规模多语言实体名称语料库,包含11800万条名称,覆盖400种语言,涉及13669694个实体,这些实体被映射到标准化的实体类型(如PER/LOC/ORG)。该数据集由布兰迪斯大学计算机科学麦克唐纳学院创建,主要用于多语言语言处理任务,如名称翻译/音译、命名实体识别和链接等。数据集的创建过程涉及从Wikidata提取数据,并通过自动化预处理流程进行过滤和标准化,以确保数据质量。ParaNames的应用领域广泛,旨在解决多语言环境中实体名称的表示和翻译问题,尤其是在资源较少的语言中。
ParaNames is a large-scale multilingual entity name corpus containing 118 million name entries, covering 400 languages and involving 13,669,694 entities. These entities are mapped to standardized entity types such as PER, LOC, and ORG. This dataset was created by the McDonald College of Computer Science at Brandeis University, and is primarily used for multilingual natural language processing tasks including name translation/transliteration, named entity recognition, entity linking, and more. The dataset's creation process involves extracting data from Wikidata, followed by filtering and standardization via an automated preprocessing pipeline to ensure data quality. ParaNames has a wide range of application scenarios, aiming to address the issues of entity name representation and translation in multilingual environments, especially for low-resource languages.
提供机构:
布兰迪斯大学计算机科学麦克唐纳学院
创建时间:
2022-03-01



