five

nwu-ctext/sesotho_ner_corpus

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/nwu-ctext/sesotho_ner_corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - found language: - st license: - other multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - token-classification task_ids: - named-entity-recognition pretty_name: Sesotho NER Corpus license_details: Creative Commons Attribution 2.5 South Africa License dataset_info: features: - name: id dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': OUT '1': B-PERS '2': I-PERS '3': B-ORG '4': I-ORG '5': B-LOC '6': I-LOC '7': B-MISC '8': I-MISC config_name: sesotho_ner_corpus splits: - name: train num_bytes: 4502576 num_examples: 9472 download_size: 30421109 dataset_size: 4502576 --- # Dataset Card for Sesotho NER Corpus ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Sesotho Ner Corpus Homepage](https://repo.sadilar.org/handle/20.500.12185/334) - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** [Martin Puttkammer](mailto:Martin.Puttkammer@nwu.ac.za) ### Dataset Summary The Sesotho Ner Corpus is a Sesotho dataset developed by [The Centre for Text Technology (CTexT), North-West University, South Africa](http://humanities.nwu.ac.za/ctext). The data is based on documents from the South African goverment domain and crawled from gov.za websites. It was created to support NER task for Sesotho language. The dataset uses CoNLL shared task annotation standards. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The language supported is Sesotho. ## Dataset Structure ### Data Instances A data point consists of sentences seperated by empty line and tab-seperated tokens and tags. ``` {'id': '0', 'ner_tags': [0, 0, 0, 0, 0], 'tokens': ['Morero', 'wa', 'weposaete', 'ya', 'Ditshebeletso'] } ``` ### Data Fields - `id`: id of the sample - `tokens`: the tokens of the example text - `ner_tags`: the NER tags of each token The NER tags correspond to this list: ``` "OUT", "B-PERS", "I-PERS", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC", ``` The NER tags have the same format as in the CoNLL shared task: a B denotes the first item of a phrase and an I any non-initial word. There are four types of phrases: person names (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC). (OUT) is used for tokens not considered part of any named entity. ### Data Splits The data was not split. ## Dataset Creation ### Curation Rationale The data was created to help introduce resources to new language - Sesotho. [More Information Needed] ### Source Data #### Initial Data Collection and Normalization The data is based on South African government domain and was crawled from gov.za websites. #### Who are the source language producers? The data was produced by writers of South African government websites - gov.za [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? The data was annotated during the NCHLT text resource development project. [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators The annotated data sets were developed by the Centre for Text Technology (CTexT, North-West University, South Africa). See: [more information](http://www.nwu.ac.za/ctext) ### Licensing Information The data is under the [Creative Commons Attribution 2.5 South Africa License](http://creativecommons.org/licenses/by/2.5/za/legalcode) ### Citation Information ``` @inproceedings{sesotho_ner_corpus, author = {M. Setaka and Roald Eiselen}, title = {NCHLT Sesotho Named Entity Annotated Corpus}, booktitle = {Eiselen, R. 2016. Government domain named entity recognition for South African languages. Proceedings of the 10th Language Resource and Evaluation Conference, Portorož, Slovenia.}, year = {2016}, url = {https://repo.sadilar.org/handle/20.500.12185/334}, } ``` ### Contributions Thanks to [@yvonnegitau](https://github.com/yvonnegitau) for adding this dataset.
提供机构:
nwu-ctext
原始信息汇总

数据集概述

数据集基本信息

  • 数据集名称: Sesotho NER Corpus
  • 语言: Sesotho
  • 许可证: Creative Commons Attribution 2.5 South Africa License
  • 数据集大小: 1K<n<10K
  • 任务类型: 命名实体识别 (Named-Entity Recognition)

数据集结构

数据字段

  • id: 样本的唯一标识符
  • tokens: 文本的词汇序列
  • ner_tags: 每个词汇的命名实体标签

命名实体标签

  • 0: OUT (非命名实体)
  • 1: B-PERS (人名开始)
  • 2: I-PERS (人名内部)
  • 3: B-ORG (组织名开始)
  • 4: I-ORG (组织名内部)
  • 5: B-LOC (地点名开始)
  • 6: I-LOC (地点名内部)
  • 7: B-MISC (杂项名开始)
  • 8: I-MISC (杂项名内部)

数据分割

  • 训练集: 包含9472个样本,大小为4502576字节

数据集来源

  • 数据来源: 南非政府域名文档,从gov.za网站爬取
  • 数据创建者: 北西大学人文学院文本技术中心 (CTexT)

引用信息

@inproceedings{sesotho_ner_corpus, author = {M. Setaka and Roald Eiselen}, title = {NCHLT Sesotho Named Entity Annotated Corpus}, booktitle = {Eiselen, R. 2016. Government domain named entity recognition for South African languages. Proceedings of the 10th Language Resource and Evaluation Conference, Portorož, Slovenia.}, year = {2016}, url = {https://repo.sadilar.org/handle/20.500.12185/334}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作