hlhdatscience/es-ner-massive

Name: hlhdatscience/es-ner-massive
Creator: hlhdatscience
Published: 2024-03-20 12:02:09
License: 暂无描述

Hugging Face2024-03-20 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/hlhdatscience/es-ner-massive

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: Tokens sequence: string - name: Tags sequence: int64 - name: Tags_string sequence: string - name: Original_source dtype: string splits: - name: train num_bytes: 276428315 num_examples: 471343 - name: test num_bytes: 6419858 num_examples: 11136 - name: validation num_bytes: 6345480 num_examples: 11456 download_size: 54821843 dataset_size: 289193653 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: validation path: data/validation-* task_categories: - token-classification language: - es size_categories: - 100K<n<1M license: apache-2.0 --- # Dataset Card for es-ner-massive ## Dataset Details ### Dataset Description The es-ner-massive dataset is a combination of three datasets: tner/wikineural, conll2002, and polyglot_ner. It is designed for Named Entity Recognition (NER) tasks. Tags are curated to be span-based and encoded according to the following convention: ```python encodings_dictionary = { "O": 0, "PER": 1, 'ORG': 2, "LOC": 3, "MISC": 4 } ``` ## Dataset Details ### Dataset Description The dataset was desing with the idea of combining middle size NER datasets in Spanish in order to perfom basic NER or to make Transfer Learning Operations with a solid knowledge base on the pretrained model. - **Curated by:** [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) - **Language(s) (NLP):** [Spanish] - **License:** [More Information Needed] ### Dataset Sources [optional] Here the original sources: [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) ## Uses The intended use is to perform fine-tune of your pretrainned model into NER task. ## Dataset Structure [More Information Needed] ## Dataset Creation ### Curation Rationale refer to the original datasets of the compilation [More Information Needed] ### Source Data #### Data Collection and Processing refer to the original datasets of the compilation [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) #### Who are the source data producers? refer to the original datasets of the compilation [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) ### Annotations [optional] All the original NER tags that were in a BIO schema were passed to Span Schema #### Annotation process refer to the original datasets of the compilation [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) #### Who are the annotators? refer to the original datasets of the compilation [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) #### Personal and Sensitive Information refer to the original datasets of the compilation [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) ## Bias, Risks, and Limitations refer to the original datasets of the compilation [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) ### Recommendations Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. refer to the original datasets of the compilation [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) ## Dataset Card Contact You can email the author of this compilation at data_analitics_HLH@protonmail.com

## 数据集信息特征： - 名称：Token（Token）序列 - 名称：标签（Tags），为64位整数序列 - 名称：标签字符串（Tags_string），为字符串序列 - 名称：原始来源（Original_source），数据类型为字符串划分集： - 名称：训练集（train），数据字节数：276428315，样本数量：471343 - 名称：测试集（test），数据字节数：6419858，样本数量：11136 - 名称：验证集（validation），数据字节数：6345480，样本数量：11456 下载大小：54821843，数据集总大小：289193653 配置项： - 配置名称：默认（default），数据文件： - 划分集：训练集，路径：data/train-* - 划分集：测试集，路径：data/test-* - 划分集：验证集，路径：data/validation-* 任务类别： - 词元分类（token-classification）语言： - 西班牙语（es）样本规模类别： - 100K<n<1M 许可证：Apache 2.0 # es-ner-massive 数据集卡片 ## 数据集详情 ### 数据集描述 es-ner-massive 数据集是 `tner/wikineural`、`conll2002` 与 `polyglot_ner` 三个数据集的整合集合，专为命名实体识别（Named Entity Recognition，NER）任务设计。标签采用基于跨度的标注方案，并按照以下规则进行编码： python encodings_dictionary = { "O": 0, "PER": 1, 'ORG': 2, "LOC": 3, "MISC": 4 } ## 数据集详情 ### 数据集描述本数据集旨在整合多款中型西班牙语命名实体识别数据集，以支撑基础命名实体识别任务，或为预训练模型的迁移学习操作提供具备可靠知识基础的训练数据。 - **整理方：** [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) - **自然语言：** 西班牙语 - **许可证：** [需获取更多信息] ### 数据集来源[可选] 原始来源如下： [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) ## 数据集用途本数据集的预期用途为：将预训练模型微调以适配命名实体识别任务。 ## 数据集结构 [需获取更多信息] ## 数据集构建 ### 整理依据请参考本次整合的原始数据集。 [需获取更多信息] ### 源数据 #### 数据收集与处理请参考本次整合的原始数据集。 [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) #### 源数据生产者是谁？请参考本次整合的原始数据集。 [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) ### 标注信息[可选] 所有原本采用BIO标注体系的原始命名实体识别标签均被转换为跨度标注方案。 #### 标注流程请参考本次整合的原始数据集。 [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) #### 标注人员是谁？请参考本次整合的原始数据集。 [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) #### 个人与敏感信息请参考本次整合的原始数据集。 [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) ## 偏差、风险与局限性请参考本次整合的原始数据集。 [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) ### 使用建议用户应当知晓本数据集存在的风险、偏差与局限性。如需进一步的使用建议，请补充更多相关信息。请参考本次整合的原始数据集。 [polyglot_ner](https://huggingface.co/datasets/polyglot_ner) [conll2002](https://huggingface.co/datasets/conll2002) [tner/wikineural](https://huggingface.co/datasets/tner/wikineural) ## 数据集卡片联系方式您可通过邮箱 data_analitics_HLH@protonmail.com 联系本次数据集整合的作者。

提供机构：

hlhdatscience

原始信息汇总

数据集卡片 for es-ner-massive

数据集详情

数据集描述

es-ner-massive 数据集是 tner/wikineural、conll2002 和 polyglot_ner 三个数据集的组合，旨在用于命名实体识别（NER）任务。标签经过精心策划，采用基于跨度的编码方式，编码约定如下：

python encodings_dictionary = { "O": 0, "PER": 1, "ORG": 2, "LOC": 3, "MISC": 4 }

数据集结构

特征

Tokens: 序列，类型为字符串
Tags: 序列，类型为 int64
Tags_string: 序列，类型为字符串
Original_source: 类型为字符串

分割

train: 字节数为 276428315，样本数为 471343
test: 字节数为 6419858，样本数为 11136
validation: 字节数为 6345480，样本数为 11456

大小

下载大小: 54821843 字节
数据集大小: 289193653 字节

配置

config_name: default
data_files:
- train: data/train-*
- test: data/test-*
- validation: data/validation-*

任务类别

token-classification

语言

西班牙语

大小类别

100K<n<1M

许可证

apache-2.0

搜集汇总

数据集介绍

构建方式

在西班牙语命名实体识别领域，数据资源的整合对于提升模型性能至关重要。es-ner-massive数据集通过精心整合三个权威数据集——tner/wikineural、conll2002与polyglot_ner构建而成。构建过程中，原始数据中的BIO标注模式被统一转换为跨度标注模式，并采用标准化的编码字典，将实体类别映射为数值标签，确保了标注体系的一致性。这一融合策略旨在汇集中等规模的西班牙语NER语料，为模型训练与迁移学习提供坚实的数据基础。

特点

该数据集在西班牙语自然语言处理领域展现出显著特点。其标注体系覆盖了人物、组织、地点及其他杂类实体，采用跨度标注模式，便于模型直接学习实体边界信息。数据规模达到数十万条实例，划分为训练、验证与测试集，具备良好的数据平衡性。作为多源数据集融合的产物，它继承了各原始数据集的优势，提供了丰富且多样化的语言表达样本，能够有效支持西班牙语命名实体识别任务的模型训练与评估。

使用方法

在应用层面，es-ner-massive数据集主要用于西班牙语命名实体识别模型的微调与评估。研究人员可直接加载数据集的标准分割，利用其提供的Tokens序列与对应的Tags或Tags_string序列进行监督学习。该数据集兼容常见的序列标注框架，能够便捷地集成至基于Transformer架构的预训练模型微调流程中。通过在此数据集上进行训练，模型能够学习识别西班牙语文本中的关键实体信息，进而提升下游任务如信息抽取、知识图谱构建等的性能表现。

背景与挑战

背景概述

在自然语言处理领域，命名实体识别作为信息抽取的关键任务，其发展依赖于高质量标注语料库的构建。es-ner-massive数据集由hlhdatscience于近年整合发布，旨在为西班牙语命名实体识别研究提供统一且规模适中的训练资源。该数据集巧妙融合了tner/wikineural、conll2002及polyglot_ner三个经典语料，通过将原有的BIO标注体系转化为基于跨度的统一编码，涵盖了人物、组织、地点及杂类四类实体标签。这一整合工作不仅扩充了西班牙语NER任务的可用数据规模，更为跨领域迁移学习与模型预训练提供了坚实的知识基础，显著推动了西班牙语信息处理技术的进步。

当前挑战

西班牙语命名实体识别任务面临多重挑战，实体边界模糊与嵌套结构识别是核心难题，尤其在处理复合组织名或含地理修饰的人名时，模型易产生误判。数据整合过程亦存在显著障碍，源数据集间标注规范与实体类别定义存在差异，需进行繁琐的映射与统一；同时，不同语料的文本风格与领域分布不均，可能导致模型偏向特定语境。此外，原始数据潜在的标注不一致性与稀疏的长尾实体实例，进一步加剧了模型泛化能力的提升难度，对算法鲁棒性提出更高要求。

常用场景

经典使用场景

在西班牙语自然语言处理领域，命名实体识别作为基础任务，其性能高度依赖于标注数据的规模与质量。es-ner-massive数据集通过整合wikineural、conll2002和polyglot_ner三个来源，构建了一个规模可观的西班牙语命名实体识别基准。该数据集最经典的使用场景是作为预训练语言模型的微调基础，研究者利用其统一的跨度标注体系，训练模型精准识别文本中的人名、组织名、地名及其他杂类实体，从而评估和提升模型在西班牙语序列标注任务上的泛化能力与鲁棒性。

实际应用

在实际应用层面，es-ner-massive数据集支撑了众多面向西班牙语用户的智能系统开发。基于该数据集训练的模型可集成于新闻媒体分析平台，自动提取报道中的关键人物与机构信息；服务于金融风控领域，从西班牙语财报或合同中识别公司实体与地理位置；亦可用于构建智能客服系统，理解用户查询中的特定实体以提供精准服务。这些应用显著提升了信息处理自动化水平，赋能了西班牙语地区的数字化转型。

衍生相关工作

该数据集的发布催生了一系列围绕西班牙语命名实体识别的经典研究工作。许多研究以此为基础，探索了如BERT、RoBERTa等预训练架构在西班牙语上的适配与优化。部分工作专注于领域自适应，利用该数据集进行知识迁移，以提升在法律、医疗等垂直领域的实体识别效果。此外，也有研究基于其统一的标注格式，开发了新的跨度识别模型或评估协议，进一步丰富了多语言信息抽取的技术生态。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集