five

imvladikon/paranames

收藏
Hugging Face2023-01-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/imvladikon/paranames
下载链接
链接失效反馈
官方服务:
资源简介:
<img src="data/paranames_banner.png"></img> # ParaNames: A multilingual resource for parallel names This repository contains releases for the ParaNames corpus, consisting of parallel names of over 12 million named entities in over 400 languages. ParaNames was introduced in [Sälevä, J. and Lignos, C., 2022. ParaNames: A Massively Multilingual Entity Name Corpus. arXiv preprint arXiv:2202.14035](https://arxiv.org/abs/2202.14035). Please cite as: ``` @article{saleva2022paranames, title={ParaNames: A Massively Multilingual Entity Name Corpus}, author={S{\"a}lev{\"a}, Jonne and Lignos, Constantine}, journal={arXiv preprint arXiv:2202.14035}, year={2022} } ``` See the [Releases page](https://github.com/bltlab/paranames/releases) for the downloadable release. # Using the data release ## Release format The corpus is released as a gzipped TSV file which is produced by the pipeline included in this repository. ## Release notes ### Repeated entities In current releases, any entity that is associated with multiple named entity types (PER, LOC, ORG) in the Wikidata type hierarchy will appear multiple times in the output, once with each type. This affects less than 3% of the entities in the data. If you want a unique set of entities, you should deduplicate the data using the `wikidata_id` field. If you only want to use entities that are associated with a single named entity type, you should remove any `wikidata_id` that appears in multiple rows. # Using the code First, install the following non-Python dependencies: - MongoDB - [xsv](https://github.com/BurntSushi/xsv) - ICU support for your computer (e.g. `libicu-dev`) Next, install ParaNames and its Python dependencies by running `pip install -e .`. It is recommended that you use a Conda environment for package management. ## Creating the ParaNames corpus To create a corpus following our approach, follow the steps below: 1. Download the latest Wikidata dump from the [Wikimedia page](https://dumps.wikimedia.org/wikidatawiki/entities/) and extract it. Note that this may take up several TB of disk space. 2. Use `recipes/paranames_pipeline.sh` which ingests the Wikidata JSON to MongoDB and then dumps and postprocesses it to our final TSV resource. The call to `recipes/paranames_pipeline.sh` works as follows: ``` recipes/paranames_pipeline.sh <path_to_extracted_json_dump> <output_folder> <n_workers> ``` Set the number of workers based on the number of CPUs your machine has. By default, only 1 CPU is used. The output folder will contain one subfolder per language, inside of which `paranames_<language_code>.tsv` can be found. The entire resource is located in `<output_folder>/combined/paranames.tsv`. ### Notes ParaNames offers several options for customization: - If your MongoDB instance uses a non-standard port, you should change the value of [`mongodb_port`](https://github.com/bltlab/paranames/blob/main/recipes/paranames_pipeline.sh#L13) accordingly inside `paranames_pipeline.sh`. - Setting [`should_collapse_languages=yes`](https://github.com/bltlab/paranames/blob/main/recipes/dump.sh#L17) will cause Wikimedia language codes to be "collapsed" to the top-level Wikimedia language code, i.e. `kk-cyrl` will be converted to `kk`, `en-ca` to `en` etc. - Setting [`should_keep_intermediate_files=yes`](https://github.com/bltlab/paranames/blob/main/recipes/dump.sh#L18) will cause intermediate files to be deleted. This includes the raw per-type TSV dumps (`{PER,LOC,ORG}.tsv`) from MongoDB, as well as outputs of `postprocess.py`. - Within [`recipes/dump.sh`](https://github.com/bltlab/paranames/blob/main/recipes/dump.sh), it is also possible to define languages to be excluded and whether entity types should be disambiguated. By default, no languages are excluded and no disambiguation is done. - After the pipeline completes, `<output_folder>` will contain one folder per language, inside of which is a TSV file containing the subset of names in that language. Combined TSVs with names in all languages are available in the `combined` folder.
提供机构:
imvladikon
原始信息汇总

数据集概述

数据集名称

ParaNames: A multilingual resource for parallel names

数据集描述

包含超过1200万命名实体的平行名称,涵盖超过400种语言。

数据集来源

介绍于Sälevä, J. and Lignos, C., 2022. ParaNames: A Massively Multilingual Entity Name Corpus. arXiv preprint arXiv:2202.14035

引用信息

@article{saleva2022paranames, title={ParaNames: A Massively Multilingual Entity Name Corpus}, author={S{"a}lev{"a}, Jonne and Lignos, Constantine}, journal={arXiv preprint arXiv:2202.14035}, year={2022} }

数据集格式

以gzipped TSV文件形式发布,由本仓库中的管道生成。

数据集使用注意事项

  • 重复实体:当前版本中,任何与多个命名实体类型(PER, LOC, ORG)关联的实体将在输出中多次出现,每次对应一个类型。这影响不到3%的数据集实体。
  • 去重:若需唯一实体集,应使用wikidata_id字段进行数据去重。
  • 单类型实体:若仅需使用与单一命名实体类型关联的实体,应移除出现在多行的wikidata_id

数据集创建指南

  1. Wikimedia page下载最新的Wikidata转储并提取。
  2. 使用recipes/paranames_pipeline.sh脚本处理数据,该脚本将Wikidata JSON导入MongoDB,然后转储并后处理为最终的TSV资源。

数据集定制选项

  • MongoDB端口自定义。
  • 语言代码折叠选项。
  • 保留或删除中间文件。
  • 定义排除语言和实体类型是否应进行消歧。
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作