nwu-ctext/siswati_ner_corpus
收藏Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/nwu-ctext/siswati_ner_corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- found
language:
- ss
license:
- other
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- token-classification
task_ids:
- named-entity-recognition
pretty_name: Siswati NER Corpus
license_details: Creative Commons Attribution 2.5 South Africa License
dataset_info:
features:
- name: id
dtype: string
- name: tokens
sequence: string
- name: ner_tags
sequence:
class_label:
names:
'0': OUT
'1': B-PERS
'2': I-PERS
'3': B-ORG
'4': I-ORG
'5': B-LOC
'6': I-LOC
'7': B-MISC
'8': I-MISC
config_name: siswati_ner_corpus
splits:
- name: train
num_bytes: 3517151
num_examples: 10798
download_size: 21882224
dataset_size: 3517151
---
# Dataset Card for Siswati NER Corpus
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [Siswati Ner Corpus Homepage](https://repo.sadilar.org/handle/20.500.12185/346)
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:** [Martin Puttkammer](mailto:Martin.Puttkammer@nwu.ac.za)
### Dataset Summary
The Siswati Ner Corpus is a Siswati dataset developed by [The Centre for Text Technology (CTexT), North-West University, South Africa](http://humanities.nwu.ac.za/ctext). The data is based on documents from the South African goverment domain and crawled from gov.za websites. It was created to support NER task for Siswati language. The dataset uses CoNLL shared task annotation standards.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
The language supported is Siswati.
## Dataset Structure
### Data Instances
A data point consists of sentences seperated by empty line and tab-seperated tokens and tags.
```
{'id': '0',
'ner_tags': [0, 0, 0, 0, 0],
'tokens': ['Tinsita', 'tebantfu', ':', 'tinsita', 'tetakhamiti']
}
```
### Data Fields
- `id`: id of the sample
- `tokens`: the tokens of the example text
- `ner_tags`: the NER tags of each token
The NER tags correspond to this list:
```
"OUT", "B-PERS", "I-PERS", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC",
```
The NER tags have the same format as in the CoNLL shared task: a B denotes the first item of a phrase and an I any non-initial word. There are four types of phrases: person names (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC). (OUT) is used for tokens not considered part of any named entity.
### Data Splits
The data was not split.
## Dataset Creation
### Curation Rationale
The data was created to help introduce resources to new language - siswati.
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
The data is based on South African government domain and was crawled from gov.za websites.
#### Who are the source language producers?
The data was produced by writers of South African government websites - gov.za
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
The data was annotated during the NCHLT text resource development project.
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
The annotated data sets were developed by the Centre for Text Technology (CTexT, North-West University, South Africa).
See: [more information](http://www.nwu.ac.za/ctext)
### Licensing Information
The data is under the [Creative Commons Attribution 2.5 South Africa License](http://creativecommons.org/licenses/by/2.5/za/legalcode)
### Citation Information
```
@inproceedings{siswati_ner_corpus,
author = {B.B. Malangwane and
M.N. Kekana and
S.S. Sedibe and
B.C. Ndhlovu and
Roald Eiselen},
title = {NCHLT Siswati Named Entity Annotated Corpus},
booktitle = {Eiselen, R. 2016. Government domain named entity recognition for South African languages. Proceedings of the 10th Language Resource and Evaluation Conference, Portorož, Slovenia.},
year = {2016},
url = {https://repo.sadilar.org/handle/20.500.12185/346},
}
```
### Contributions
Thanks to [@yvonnegitau](https://github.com/yvonnegitau) for adding this dataset.
提供机构:
nwu-ctext
原始信息汇总
数据集概述
数据集基本信息
- 数据集名称: Siswati NER Corpus
- 数据集创建者: 中心文本技术(CTexT),北西大学,南非
- 语言: Siswati
- 许可证: Creative Commons Attribution 2.5 South Africa License
- 数据集大小: 10K<n<100K
- 任务类别: 命名实体识别(Named-Entity Recognition)
数据集结构
数据字段
- id: 样本的唯一标识符
- tokens: 样本的文本标记
- ner_tags: 每个标记的命名实体识别标签
命名实体标签
- 0: OUT
- 1: B-PERS
- 2: I-PERS
- 3: B-ORG
- 4: I-ORG
- 5: B-LOC
- 6: I-LOC
- 7: B-MISC
- 8: I-MISC
数据分割
- 训练集: 包含10798个样本,总大小为3517151字节
数据集创建
数据来源
- 数据收集: 基于南非政府域名,从gov.za网站爬取
- 数据标注: 由专家生成
许可证信息
- 许可证详情: Creative Commons Attribution 2.5 South Africa License
引用信息
@inproceedings{siswati_ner_corpus, author = {B.B. Malangwane and M.N. Kekana and S.S. Sedibe and B.C. Ndhlovu and Roald Eiselen}, title = {NCHLT Siswati Named Entity Annotated Corpus}, booktitle = {Eiselen, R. 2016. Government domain named entity recognition for South African languages. Proceedings of the 10th Language Resource and Evaluation Conference, Portorož, Slovenia.}, year = {2016}, url = {https://repo.sadilar.org/handle/20.500.12185/346}, }



