peoples-daily-ner/peoples_daily_ner|中文文本处理数据集|命名实体识别数据集

hugging_face2024-01-18 更新2024-06-15 收录

中文文本处理

命名实体识别

下载链接：

https://hf-mirror.com/datasets/peoples-daily-ner/peoples_daily_ner

下载链接

链接失效反馈

资源简介：

--- annotations_creators: - expert-generated language_creators: - found language: - zh license: - unknown multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - token-classification task_ids: - named-entity-recognition pretty_name: People's Daily NER dataset_info: features: - name: id dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-PER '2': I-PER '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC config_name: peoples_daily_ner splits: - name: train num_bytes: 14972456 num_examples: 20865 - name: validation num_bytes: 1676741 num_examples: 2319 - name: test num_bytes: 3346975 num_examples: 4637 download_size: 8385672 dataset_size: 19996172 --- # Dataset Card for People's Daily NER ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Github](https://github.com/OYE93/Chinese-NLP-Corpus/tree/master/NER/People's%20Daily) - **Repository:** [Github](https://github.com/OYE93/Chinese-NLP-Corpus/) - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary [More Information Needed] ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information No citation available for this dataset. ### Contributions Thanks to [@JetRunner](https://github.com/JetRunner) for adding this dataset.

提供机构：

peoples-daily-ner

原始信息汇总

数据集卡片 for Peoples Daily NER

数据集描述

数据集概述

annotations_creators: expert-generated
language_creators: found
language: zh
license: unknown
multilinguality: monolingual
size_categories: 10K<n<100K
source_datasets: original
task_categories: token-classification
task_ids: named-entity-recognition
pretty_name: Peoples Daily NER

数据集结构

数据字段

id: string
tokens: sequence of string
ner_tags: sequence of class_label
- names:
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC

数据分割

train:
- num_bytes: 14972456
- num_examples: 20865
validation:
- num_bytes: 1676741
- num_examples: 2319
test:
- num_bytes: 3346975
- num_examples: 4637

数据集大小

download_size: 8385672
dataset_size: 19996172

AI搜集汇总

数据集介绍

构建方式

人民日报命名实体识别数据集（People's Daily NER）的构建基于专家生成的标注，涵盖了从《人民日报》中提取的原始文本。该数据集通过专家的手工标注，确保了命名实体识别任务中标签的高质量。数据集的标注过程严格遵循命名实体识别的标准，包括人名（PER）、组织名（ORG）和地名（LOC）等类别，为中文自然语言处理领域提供了丰富的资源。

特点

该数据集的主要特点在于其高质量的专家标注和广泛的应用场景。数据集包含超过20,000条训练样本，涵盖了多种命名实体类型，如人名、组织名和地名，适用于多种自然语言处理任务。此外，数据集的单语特性使其特别适合中文命名实体识别的研究和应用，为中文语境下的实体识别提供了可靠的基准。

使用方法

人民日报命名实体识别数据集可用于训练和评估命名实体识别模型。用户可以通过加载数据集的训练、验证和测试分割，分别用于模型的训练、调优和性能评估。数据集的特征包括文本序列和对应的命名实体标签，用户可以根据这些特征构建和优化模型。该数据集适用于多种深度学习框架，如TensorFlow和PyTorch，为中文命名实体识别任务提供了标准化的数据支持。

背景与挑战

背景概述

人民日报命名实体识别数据集（People's Daily NER）是由专家生成的标注数据集，专门用于中文命名实体识别（Named Entity Recognition, NER）任务。该数据集的核心研究问题是如何在中文文本中准确识别并分类人名、组织名和地名等实体。该数据集的创建旨在为中文自然语言处理领域提供一个标准化的基准，以推动命名实体识别技术的发展。尽管具体创建时间和主要研究人员信息未明确，但其对中文NER领域的贡献不容忽视，尤其是在推动相关算法和模型的性能提升方面。

当前挑战

人民日报NER数据集在构建过程中面临多项挑战。首先，中文文本的复杂性使得实体边界识别尤为困难，尤其是嵌套实体和长距离依赖问题。其次，数据标注的一致性和准确性是另一大挑战，专家生成的标注虽然质量较高，但成本和时间投入较大。此外，数据集的规模和多样性也限制了其在不同领域和场景中的泛化能力。最后，数据集的许可信息不明确，可能影响其在学术和商业应用中的使用。

常用场景

经典使用场景

人民日报命名实体识别数据集（People's Daily NER）在自然语言处理领域中，主要用于中文命名实体识别（Named Entity Recognition, NER）任务。该数据集通过标注文本中的实体，如人名（PER）、组织名（ORG）和地名（LOC），为研究者提供了一个标准化的基准，用于训练和评估NER模型。其经典使用场景包括构建和优化中文NER模型，特别是在新闻文本中的实体识别任务，为信息抽取、知识图谱构建等应用提供了基础数据支持。

解决学术问题

人民日报NER数据集解决了中文命名实体识别领域中的关键学术问题，特别是在缺乏大规模标注数据的情况下，如何有效提升模型性能。通过提供高质量的标注数据，该数据集为研究者提供了一个标准化的测试平台，促进了中文NER技术的进步。其意义在于推动了中文自然语言处理领域的发展，尤其是在信息抽取、文本理解等方向上，为后续研究奠定了坚实的基础。

衍生相关工作

基于人民日报NER数据集，研究者们开发了多种中文NER模型，并在此基础上进行了深入的研究和扩展。例如，一些研究工作通过引入预训练语言模型（如BERT）来进一步提升NER性能，另一些工作则探索了多任务学习、跨语言迁移等方法，以应对不同领域和场景下的NER任务。这些衍生工作不仅丰富了中文NER的研究内容，也为实际应用提供了更多技术选择。

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

中国食物成分数据库

食物成分数据比较准确而详细地描述农作物、水产类、畜禽肉类等人类赖以生存的基本食物的品质和营养成分含量。它是一个重要的我国公共卫生数据和营养信息资源，是提供人类基本需求和基本社会保障的先决条件；也是一个国家制定相关法规标准、实施有关营养政策、开展食品贸易和进行营养健康教育的基础，兼具学术、经济、社会等多种价值。本数据集收录了基于2002年食物成分表的1506条食物的31项营养成分（含胆固醇）数据，657条食物的18种氨基酸数据、441条食物的32种脂肪酸数据、130条食物的碘数据、114条食物的大豆异黄酮数据。

国家人口健康科学数据中心收录

中国气象数据

本数据集包含了中国2023年1月至11月的气象数据，包括日照时间、降雨量、温度、风速等关键数据。通过这些数据，可以深入了解气象现象对不同地区的影响，并通过可视化工具揭示中国的气温分布、降水情况、风速趋势等。

github 收录

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录

jpft/danbooru2023

Danbooru2023是一个大规模的动漫图像数据集，包含超过500万张由爱好者社区贡献并详细标注的图像。图像标签涵盖角色、场景、版权、艺术家等方面，平均每张图像有30个标签。该数据集可用于训练图像分类、多标签标注、角色检测、生成模型等多种计算机视觉任务。数据集基于danbooru2021构建，扩展至包含ID #6,857,737的图像，增加了超过180万张新图像，总大小约为8TB。图像以原始格式提供，分为1000个子目录，使用图像ID的模1000进行分桶，以避免文件系统性能问题。

hugging_face 收录

DLLG数据集

DLLG数据集是一个包含道路垃圾图片的数据集，主要用于训练深度学习模型以识别和分类道路垃圾。数据集来源包括机器人视角拍摄、手机相机拍摄和网络图片，涵盖塑料袋、饮料瓶和易拉罐三类垃圾。数据集旨在增强训练网络的鲁棒性，通过不同时间、天气和光照条件下的图片收集，以及包含不同形态的垃圾案例。

github 收录