five

cfilt/HiNER-original

收藏
Hugging Face2023-03-07 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/cfilt/HiNER-original
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - expert-generated language: - hi license: "cc-by-sa-4.0" multilinguality: - monolingual paperswithcode_id: hiner-original-1 pretty_name: HiNER - Large Hindi Named Entity Recognition dataset size_categories: - 100K<n<1M source_datasets: - original task_categories: - token-classification task_ids: - named-entity-recognition --- <p align="center"><img src="https://huggingface.co/datasets/cfilt/HiNER-collapsed/raw/main/cfilt-dark-vec.png" alt="Computation for Indian Language Technology Logo" width="150" height="150"/></p> # Dataset Card for HiNER-original [![Twitter Follow](https://img.shields.io/twitter/follow/cfiltnlp?color=1DA1F2&logo=twitter&style=flat-square)](https://twitter.com/cfiltnlp) [![Twitter Follow](https://img.shields.io/twitter/follow/PeopleCentredAI?color=1DA1F2&logo=twitter&style=flat-square)](https://twitter.com/PeopleCentredAI) ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** https://github.com/cfiltnlp/HiNER - **Repository:** https://github.com/cfiltnlp/HiNER - **Paper:** https://arxiv.org/abs/2204.13743 - **Leaderboard:** https://paperswithcode.com/sota/named-entity-recognition-on-hiner-original - **Point of Contact:** Rudra Murthy V ### Dataset Summary This dataset was created for the fundamental NLP task of Named Entity Recognition for the Hindi language at CFILT Lab, IIT Bombay. We gathered the dataset from various government information webpages and manually annotated these sentences as a part of our data collection strategy. **Note:** The dataset contains sentences from ILCI and other sources. ILCI dataset requires license from Indian Language Consortium due to which we do not distribute the ILCI portion of the data. Please send us a mail with proof of ILCI data acquisition to obtain the full dataset. ### Supported Tasks and Leaderboards Named Entity Recognition ### Languages Hindi ## Dataset Structure ### Data Instances {'id': '0', 'tokens': ['प्राचीन', 'समय', 'में', 'उड़ीसा', 'को', 'कलिंग','के', 'नाम', 'से', 'जाना', 'जाता', 'था', '।'], 'ner_tags': [0, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 0, 0]} ### Data Fields - `id`: The ID value of the data point. - `tokens`: Raw tokens in the dataset. - `ner_tags`: the NER tags for this dataset. ### Data Splits | | Train | Valid | Test | | ----- | ------ | ----- | ---- | | original | 76025 | 10861 | 21722| | collapsed | 76025 | 10861 | 21722| ## About This repository contains the Hindi Named Entity Recognition dataset (HiNER) published at the Langauge Resources and Evaluation conference (LREC) in 2022. A pre-print via arXiv is available [here](https://arxiv.org/abs/2204.13743). ### Recent Updates * Version 0.0.5: HiNER initial release ## Usage You should have the 'datasets' packages installed to be able to use the :rocket: HuggingFace datasets repository. Please use the following command and install via pip: ```code pip install datasets ``` To use the original dataset with all the tags, please use:<br/> ```python from datasets import load_dataset hiner = load_dataset('cfilt/HiNER-original') ``` To use the collapsed dataset with only PER, LOC, and ORG tags, please use:<br/> ```python from datasets import load_dataset hiner = load_dataset('cfilt/HiNER-collapsed') ``` However, the CoNLL format dataset files can also be found on this Git repository under the [data](data/) folder. ## Model(s) Our best performing models are hosted on the HuggingFace models repository: 1. [HiNER-Collapsed-XLM-R](https://huggingface.co/cfilt/HiNER-Collapse-XLM-Roberta-Large) 2. [HiNER-Original-XLM-R](https://huggingface.co/cfilt/HiNER-Original-XLM-Roberta-Large) ## Dataset Creation ### Curation Rationale HiNER was built on data extracted from various government websites handled by the Government of India which provide information in Hindi. This dataset was built for the task of Named Entity Recognition. The dataset was introduced to introduce new resources to the Hindi language that was under-served for Natural Language Processing. ### Source Data #### Initial Data Collection and Normalization HiNER was built on data extracted from various government websites handled by the Government of India which provide information in Hindi #### Who are the source language producers? Various Government of India webpages ### Annotations #### Annotation process This dataset was manually annotated by a single annotator of a long span of time. #### Who are the annotators? Pallab Bhattacharjee ### Personal and Sensitive Information We ensured that there was no sensitive information present in the dataset. All the data points are curated from publicly available information. ## Considerations for Using the Data ### Social Impact of Dataset The purpose of this dataset is to provide a large Hindi Named Entity Recognition dataset. Since the information (data points) has been obtained from public resources, we do not think there is a negative social impact in releasing this data. ### Discussion of Biases Any biases contained in the data released by the Indian government are bound to be present in our data. ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators Pallab Bhattacharjee ### Licensing Information CC-BY-SA 4.0 ### Citation Information ```latex @misc{https://doi.org/10.48550/arxiv.2204.13743, doi = {10.48550/ARXIV.2204.13743}, url = {https://arxiv.org/abs/2204.13743}, author = {Murthy, Rudra and Bhattacharjee, Pallab and Sharnagat, Rahul and Khatri, Jyotsana and Kanojia, Diptesh and Bhattacharyya, Pushpak}, keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {HiNER: A Large Hindi Named Entity Recognition Dataset}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution 4.0 International} } ```
提供机构:
cfilt
原始信息汇总

数据集概述

  • 名称: HiNER - Large Hindi Named Entity Recognition dataset
  • 语言: 印地语(Hindi)
  • 许可证: CC-BY-SA-4.0
  • 多语言性: 单语种
  • 数据集大小: 100K<n<1M
  • 源数据集: 原始数据
  • 任务类别: 令牌分类
  • 任务ID: 命名实体识别

数据集描述

  • 摘要: 该数据集是为印地语的命名实体识别任务创建的,数据来源于多个政府信息网页,并进行了手动标注。
  • 支持的任务: 命名实体识别
  • 语言: 印地语

数据集结构

  • 数据实例: 包含ID、令牌和NER标签。
  • 数据字段:
    • id: 数据点的ID值。
    • tokens: 数据集中的原始令牌。
    • ner_tags: 数据集的NER标签。
  • 数据分割:
    • 原始数据集: 训练集76025条,验证集10861条,测试集21722条。
    • 压缩数据集: 训练集76025条,验证集10861条,测试集21722条。

数据集创建

  • 来源数据: 数据来源于印度政府管理的多个网站,提供印地语信息。
  • 标注过程: 数据集由单一标注员手动标注。
  • 标注者: Pallab Bhattacharjee

使用数据注意事项

  • 社会影响: 数据集旨在提供一个大型印地语命名实体识别数据集,由于数据点来自公共资源,预计不会产生负面社会影响。
  • 偏见讨论: 数据中可能包含印度政府发布的数据中的任何偏见。

附加信息

  • 数据集管理员: Pallab Bhattacharjee
  • 许可证信息: CC-BY-SA 4.0
  • 引用信息: 参见arXiv:2204.13743。
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
HiNER-original是一个专为印地语命名实体识别任务构建的数据集,由印度理工学院孟买分校CFILT实验室从印度政府网站收集并人工标注而成。该数据集规模适中,包含原始和简化两个版本,分别对应不同的实体标签体系,适用于训练和评估印地语NER模型。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作