community-datasets/sepedi_ner

Name: community-datasets/sepedi_ner
Creator: community-datasets
Published: 2024-06-26 06:16:45
License: 暂无描述

Hugging Face2024-06-26 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/community-datasets/sepedi_ner

下载链接

链接失效反馈

官方服务：

资源简介：

Sepedi NER Corpus是一个用于命名实体识别（NER）任务的Sepedi语言数据集，由南非西北大学的文本技术中心（CTexT）开发。数据集基于南非政府领域的文档，并从gov.za网站爬取。数据集遵循CoNLL共享任务的注释标准，旨在为Sepedi语言引入资源。数据集的结构包括数据实例、数据字段和数据分割。数据实例由句子组成，句子由空行分隔，标记和标签由制表符分隔。数据字段包括id、tokens和ner_tags。NER标签遵循CoNLL共享任务的格式，包括OUT、B-PERS、I-PERS、B-ORG、I-ORG、B-LOC、I-LOC、B-MISC和I-MISC。数据集未进行分割。数据集的创建理由是为了帮助引入Sepedi语言的资源。数据集的来源是南非政府领域的文档，由gov.za网站的生产者生成。注释过程由NCHLT文本资源开发项目完成。数据集的许可证是Creative Commons Attribution 2.5 South Africa License。

The Sepedi NER Corpus is a Sepedi dataset developed by The Centre for Text Technology (CTexT), North-West University, South Africa. The data is based on documents from the South African government domain and crawled from gov.za websites. It was created to support the Named Entity Recognition (NER) task for the Sepedi language. The dataset uses CoNLL shared task annotation standards. The dataset structure includes data instances, data fields, and data splits. Data instances consist of sentences separated by empty lines, with tokens and tags separated by tabs. Data fields include id, tokens, and ner_tags. The NER tags follow the CoNLL shared task format, including OUT, B-PERS, I-PERS, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, and I-MISC. The dataset was not split. The curation rationale was to help introduce resources for the Sepedi language. The source data is based on documents from the South African government domain, produced by writers of gov.za websites. The annotation process was carried out during the NCHLT text resource development project. The dataset is licensed under the Creative Commons Attribution 2.5 South Africa License.

提供机构：

community-datasets

原始信息汇总

数据集概述

基本信息

数据集名称: Sepedi NER Corpus
语言: Sepedi
许可证: Creative Commons Attribution 2.5 South Africa License
数据集大小: 1K<n<10K
任务类别: 命名实体识别 (Named Entity Recognition)
数据集来源: 原始数据

数据集结构

特征

id: 字符串类型，样本的唯一标识
tokens: 字符串序列，示例文本的词元
ner_tags: 序列类型，每个词元的命名实体标签

命名实体标签

0: OUT
1: B-PERS
2: I-PERS
3: B-ORG
4: I-ORG
5: B-LOC
6: I-LOC
7: B-MISC
8: I-MISC

数据分割

训练集: 7117个样本，3378134字节

数据集创建

数据来源

数据基于南非政府域名，从gov.za网站爬取

标注过程

数据由专家生成

许可证详情

数据集使用Creative Commons Attribution 2.5 South Africa License许可证

引用信息

@inproceedings{sepedi_ner_corpus, author = {D.J. Prinsloo and Roald Eiselen}, title = {NCHLT Sepedi Named Entity Annotated Corpus}, booktitle = {Eiselen, R. 2016. Government domain named entity recognition for South African languages. Proceedings of the 10th Language Resource and Evaluation Conference, Portorož, Slovenia.}, year = {2016}, url = {https://repo.sadilar.org/handle/20.500.12185/328}, }

5,000+

优质数据集

54 个

任务类型

进入经典数据集