MultiNERD|命名实体识别数据集|自然语言处理数据集
收藏数据集概述
数据集名称
- MultiNERD-NER
数据集来源
- Hugging Face
数据集链接
数据集内容
- 英文子集:位于
dataset
文件夹中。
预训练模型
- RoBERTa-base:预训练模型链接
数据集用途
- 用于训练和评估NER(命名实体识别)模型。
数据集发布
- 已上传至Kaggle,供公众使用:Kaggle链接
相关文档
MultiNERD_NER__RISE.pdf
:记录了研究成果和局限性。
使用环境
- 运行环境:Jupyter Notebook,需配备CUDA支持的GPU。
安装与使用
- 安装:通过Git克隆仓库,创建并激活虚拟环境,安装
requirements.txt
中的依赖。 - 使用:通过Jupyter Notebook或VSCode运行
.ipynb
脚本。finetuning.ipynb
:用于微调RoBERTa-base模型。evalution.ipynb
:用于评估微调后的模型在测试集上的表现。

- MultiNERD数据集首次发表,由Nils Reimers等人提出,旨在解决多语言和多领域命名实体识别的问题。
- MultiNERD数据集首次应用于自然语言处理领域的研究,特别是在跨语言命名实体识别任务中展示了其优越性。
- 1Multi-Domain Named Entity Recognition with Genre-Aware and Agnostic InferenceUniversity of Copenhagen · 2022年
- 2Multi-Domain Named Entity Recognition with Genre-Aware and Agnostic InferenceUniversity of Copenhagen · 2022年
- 3Multi-Domain Named Entity Recognition with Genre-Aware and Agnostic InferenceUniversity of Copenhagen · 2022年
YOLO Drone Detection Dataset
为了促进无人机检测模型的开发和评估,我们引入了一个新颖且全面的数据集,专门为训练和测试无人机检测算法而设计。该数据集来源于Kaggle上的公开数据集,包含在各种环境和摄像机视角下捕获的多样化的带注释图像。数据集包括无人机实例以及其他常见对象,以实现强大的检测和分类。
github 收录
FAOSTAT Agricultural Data
FAOSTAT Agricultural Data 是由联合国粮食及农业组织(FAO)提供的全球农业数据集。该数据集涵盖了农业生产、贸易、价格、土地利用、水资源、气候变化、人口统计等多个方面的详细信息。数据包括了全球各个国家和地区的农业统计数据,旨在为政策制定者、研究人员和公众提供全面的农业信息。
www.fao.org 收录
Population and Housing Census of 2007 - Ethiopia
Geographic coverage --------------------------- National coverage Analysis unit --------------------------- Household Person Housing unit Universe --------------------------- The census has counted people on dejure and defacto basis. The dejure population comprises all the persons who belong to a given area at a given time by virtue of usual residence, while under defacto approach people were counted as the residents of the place where they found. In the census, a person is said to be a usual resident of a household (and hence an area) if he/she has been residing in the household continuously for at least six months before the census day or intends to reside in the household for six months or longer. Thus, visitors are not included with the usual (dejure) population. Homeless persons were enumerated in the place where they spent the night on the enumeration day. The 2007 census counted foreign nationals who were residing in the city administration. On the other hand all Ethiopians living abroad were not counted. Kind of data --------------------------- Census/enumeration data [cen] Mode of data collection --------------------------- Face-to-face [f2f] Research instrument --------------------------- Two type sof questionnaires were used to collect census data: i) Short questionnaire ii) Long questionnaire Unlike the previous censuses, the contents of the short and long questionnaires were similar both for the urban and rural areas as well as for the entire city. But the short and the long questionnaires differ by the number of variables they contained. That is, the short questionnaire was used to collect basic data on population characteristics, such as population size, sex, age, language, ethnic group, religion, orphanhood and disability. Whereas the long questionnaire includes information on marital status, education, economic activity, migration, fertility, mortality, as well as housing stocks and conditions in addition to those questions contained in a short questionnaire.
catalog.ihsn.org 收录
Nexdata/chinese_dialect
该数据集包含25,000小时的中文方言语音数据,收集自多个方言区域的本地方言使用者,涵盖闽南语、粤语、四川话、河南话、东北话、上海话、维吾尔语和藏语等。数据格式为16kHz、16bit、未压缩的wav文件,单声道。句子准确率超过95%。数据集支持的任务包括自动语音识别(ASR)和音频说话人识别。
hugging_face 收录
URPC系列数据集, S-URPC2019, UDD
URPC系列数据集包括URPC2017至URPC2020DL,主要用于水下目标的检测和分类。S-URPC2019专注于水下环境的特定检测任务。UDD数据集信息未在README中详细描述。
github 收录