nwu-ctext/nchlt

Name: nwu-ctext/nchlt
Creator: nwu-ctext
Published: 2024-01-18 11:10:13
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/nwu-ctext/nchlt

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - expert-generated language: - af - nr - nso - ss - tn - ts - ve - xh - zu license: - cc-by-2.5 multilinguality: - multilingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - token-classification task_ids: - named-entity-recognition pretty_name: NCHLT dataset_info: - config_name: af features: - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': OUT '1': B-PERS '2': I-PERS '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC '7': B-MISC '8': I-MISC splits: - name: train num_bytes: 3955069 num_examples: 8961 download_size: 25748344 dataset_size: 3955069 - config_name: nr features: - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': OUT '1': B-PERS '2': I-PERS '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC '7': B-MISC '8': I-MISC splits: - name: train num_bytes: 3188781 num_examples: 9334 download_size: 20040327 dataset_size: 3188781 - config_name: xh features: - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': OUT '1': B-PERS '2': I-PERS '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC '7': B-MISC '8': I-MISC splits: - name: train num_bytes: 2365821 num_examples: 6283 download_size: 14513302 dataset_size: 2365821 - config_name: zu features: - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': OUT '1': B-PERS '2': I-PERS '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC '7': B-MISC '8': I-MISC splits: - name: train num_bytes: 3951366 num_examples: 10955 download_size: 25097584 dataset_size: 3951366 - config_name: nso-sepedi features: - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': OUT '1': B-PERS '2': I-PERS '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC '7': B-MISC '8': I-MISC splits: - name: train num_bytes: 3322296 num_examples: 7116 download_size: 22077376 dataset_size: 3322296 - config_name: nso-sesotho features: - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': OUT '1': B-PERS '2': I-PERS '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC '7': B-MISC '8': I-MISC splits: - name: train num_bytes: 4427898 num_examples: 9471 download_size: 30421109 dataset_size: 4427898 - config_name: tn features: - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': OUT '1': B-PERS '2': I-PERS '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC '7': B-MISC '8': I-MISC splits: - name: train num_bytes: 3812339 num_examples: 7943 download_size: 25905236 dataset_size: 3812339 - config_name: ss features: - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': OUT '1': B-PERS '2': I-PERS '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC '7': B-MISC '8': I-MISC splits: - name: train num_bytes: 3431063 num_examples: 10797 download_size: 21882224 dataset_size: 3431063 - config_name: ve features: - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': OUT '1': B-PERS '2': I-PERS '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC '7': B-MISC '8': I-MISC splits: - name: train num_bytes: 3941041 num_examples: 8477 download_size: 26382457 dataset_size: 3941041 - config_name: ts features: - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': OUT '1': B-PERS '2': I-PERS '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC '7': B-MISC '8': I-MISC splits: - name: train num_bytes: 3941041 num_examples: 8477 download_size: 26382457 dataset_size: 3941041 --- # Dataset Card for NCHLT ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [link](https://repo.sadilar.org/handle/20.500.12185/7/discover?filtertype_0=database&filtertype_1=title&filter_relational_operator_1=contains&filter_relational_operator_0=equals&filter_1=&filter_0=Monolingual+Text+Corpora%3A+Annotated&filtertype=project&filter_relational_operator=equals&filter=NCHLT+Text+II) - **Repository:** []() - **Paper:** []() - **Leaderboard:** []() - **Point of Contact:** []() ### Dataset Summary The development of linguistic resources for use in natural language processingis of utmost importance for the continued growth of research anddevelopment in the field, especially for resource-scarce languages. In this paper we describe the process and challenges of simultaneouslydevelopingmultiple linguistic resources for ten of the official languages of South Africa. The project focussed on establishing a set of foundational resources that can foster further development of both resources and technologies for the NLP industry in South Africa. The development efforts during the project included creating monolingual unannotated corpora, of which a subset of the corpora for each language was annotated on token, orthographic, morphological and morphosyntactic layers. The annotated subsetsincludes both development and test setsand were used in the creation of five core-technologies, viz. atokeniser, sentenciser,lemmatiser, part of speech tagger and morphological decomposer for each language. We report on the quality of these tools for each language and provide some more context of the importance of the resources within the South African context. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure [More Information Needed] ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Martin.Puttkammer@nwu.ac.za ### Licensing Information [More Information Needed] ### Citation Information ``` @inproceedings{eiselen2014developing, title={Developing Text Resources for Ten South African Languages.}, author={Eiselen, Roald and Puttkammer, Martin J}, booktitle={LREC}, pages={3698--3703}, year={2014} } ``` ### Contributions Thanks to [@Narsil](https://github.com/Narsil) for adding this dataset.

提供机构：

nwu-ctext

原始信息汇总

数据集卡片 for NCHLT

数据集描述

数据集摘要

NCHLT 数据集是为南非的十种官方语言开发的语言资源，专注于创建基础资源以促进南非 NLP 行业的发展。该项目包括创建未注释的单语语料库，并对每个语言的子集进行标记、正字法、形态和形态句法层的注释。注释子集包括开发和测试集，用于创建五种核心技术：分词器、句子分割器、词形还原器、词性标注器和形态分解器。

支持的任务和排行榜

[更多信息需补充]

语言

数据集包含以下语言：

数据集结构

数据实例

[更多信息需补充]

数据字段

每个配置包含以下特征：

tokens: 字符串序列
ner_tags: 命名实体识别标签序列，包含以下类别：
- 0: OUT
- 1: B-PERS
- 2: I-PERS
- 3: B-ORG
- 4: I-ORG
- 5: B-LOC
- 6: I-LOC
- 7: B-MISC
- 8: I-MISC

数据分割

每个配置包含一个训练集分割：

train: 训练集

数据集创建

策划理由

[更多信息需补充]

源数据

[更多信息需补充]

初始数据收集和规范化

[更多信息需补充]

源语言生产者是谁？

[更多信息需补充]

注释

[更多信息需补充]

注释过程

[更多信息需补充]

注释者是谁？

[更多信息需补充]

个人和敏感信息

[更多信息需补充]

使用数据的注意事项

数据集的社会影响

[更多信息需补充]

偏见的讨论

[更多信息需补充]

其他已知限制

[更多信息需补充]

附加信息

数据集策展人

Martin.Puttkammer@nwu.ac.za

许可信息

CC-BY-2.5

引用信息

@inproceedings{eiselen2014developing, title={Developing Text Resources for Ten South African Languages.}, author={Eiselen, Roald and Puttkammer, Martin J}, booktitle={LREC}, pages={3698--3703}, year={2014} }

贡献

感谢 @Narsil 添加此数据集。

搜集汇总

数据集介绍

构建方式

在自然语言处理领域，针对资源稀缺语言构建高质量数据集是推动技术发展的关键。NCHLT数据集聚焦于南非十种官方语言，其构建过程体现了系统性资源开发的严谨性。该数据集由专家团队精心创建，涵盖了从原始语料收集到多层次标注的全流程。语料库首先以单语形式汇集，随后从中选取子集进行细致的标注工作，标注层面包括词汇、正字法、形态学及形态句法等多个维度。这种分层标注策略不仅确保了数据的丰富性，也为后续核心技术的开发奠定了坚实基础。

特点

NCHLT数据集以其多语言覆盖和精细的标注体系而著称，为南非多种官方语言的自然语言处理研究提供了宝贵资源。数据集包含阿非利卡语、祖鲁语、科萨语等十种语言，每种语言均配置独立的训练集，规模在六千至一万余条样本之间。其核心特征在于采用序列标注格式，每条数据由词汇序列及对应的命名实体识别标签序列构成，标签体系遵循BIO标注方案，涵盖人物、组织、地点及其他杂类实体。这种结构化的标注方式使得数据集特别适用于训练和评估命名实体识别模型，尤其有助于提升资源稀缺语言的实体识别性能。

使用方法

该数据集主要服务于命名实体识别这一序列标注任务，为相关模型的训练与评估提供标准数据。研究人员可通过HuggingFace平台便捷加载特定语言配置的数据，例如加载祖鲁语（zu）或科萨语（xh）的子集。数据以`tokens`和`ner_tags`两个特征字段呈现，可直接输入到支持分词序列输入的神经网络模型中进行训练。鉴于其多语言特性，该数据集既可用于开发单一语言的专用模型，也可作为跨语言迁移学习或语言特性对比研究的实验数据，为探索南非多语言环境下的信息提取技术提供了重要支持。

背景与挑战

背景概述

在自然语言处理领域，语言资源的开发对于推动研究进展至关重要，尤其对于资源稀缺的语言而言。NCHLT数据集由南非西北大学的研究团队于2014年主导创建，旨在为南非的十种官方语言构建基础语言资源。该数据集的核心研究问题聚焦于为南非的多语言环境提供标注语料，以支持命名实体识别等核心自然语言处理任务。通过专家标注的词汇、拼写、形态和形态句法层面信息，NCHLT不仅填补了非洲语言资源的空白，还为后续的语言技术开发奠定了坚实基础，对促进南非乃至全球多语言信息处理研究产生了深远影响。

当前挑战

NCHLT数据集面临的挑战主要体现在两个方面。在领域问题层面，命名实体识别任务需应对南非多语言环境中实体表达的多样性和复杂性，例如不同语言中人物、组织、地点等实体的命名规范差异，这对模型的跨语言泛化能力提出了较高要求。在构建过程层面，资源稀缺语言的语料收集与标注是一大难题，专家标注者需克服语言变体丰富、标准化资源不足等障碍，同时确保十种语言标注质量的一致性与可比性，这一过程耗费大量人力物力，凸显了低资源语言数据处理中的固有困难。

常用场景

经典使用场景

在自然语言处理领域，针对资源稀缺语言的研究常因数据匮乏而受限。NCHLT数据集作为南非十种官方语言的命名实体识别标注语料，其经典使用场景在于为这些语言构建基础的语言技术工具。研究者利用该数据集训练和评估分词器、词性标注器及命名实体识别模型，为南非多语言环境下的文本分析提供标准化基准。通过系统性的标注框架，该数据集支持跨语言模型比较，促进低资源语言处理技术的均衡发展。

衍生相关工作

围绕NCHLT数据集衍生的经典工作主要集中在低资源语言处理技术体系的构建。例如，Eiselen与Puttkammer基于该数据开发了涵盖分词、词性标注、词形还原等功能的语言技术栈。后续研究则利用其多语言特性探索跨语言序列标注迁移方法，或结合该数据集与其他非洲语言资源进行对比分析。这些工作共同推动了以语言平等为目标的全球自然语言处理研究议程。

数据集最近研究