HiNER
收藏HiNER - 印地语命名实体识别数据集
关于
该仓库包含2022年在语言资源和评估会议(LREC)上发布的印地语命名实体识别数据集(HiNER)。arXiv预印本可在此处获取。
最新更新
- 版本0.0.5:HiNER初始发布
使用方法
您需要安装datasets包才能使用HuggingFace数据集仓库。请使用以下命令通过pip安装:
code pip install datasets
要使用包含所有标签的原始数据集,请使用:
python from datasets import load_dataset hiner = load_dataset(cfilt/HiNER-original)
要使用仅包含PER、LOC和ORG标签的简化数据集,请使用:
python from datasets import load_dataset hiner = load_dataset(cfilt/HiNER-collapsed)
CoNLL格式的数据集文件也可以在本Git仓库的data文件夹中找到。
模型
我们的最佳性能模型托管在HuggingFace模型仓库中:
| 模型 | HiNER - Original |
HiNER - Collapsed |
描述 |
|---|---|---|---|
| XLM-R<sub>large</sub> | HiNER-Original-XLM-R-Large | HiNER-Collapsed-XLM-R-Large | 在XLM-R<sub>large</sub>多语言语言模型上进行微调 |
| MuRIL<sub>base</sub> | HiNER-Original-MuRIL-Base | HiNER-Collapsed-MuRIL-Base | 在MuRIL<sub>base</sub>多语言语言模型上进行微调 |
维护者
Diptesh Kanojia<br/> Rudra Murthy V<br/>
引用
Murthy, R., Bhattacharjee, P., Sharnagat, R., Khatri, J., Kanojia, D. and Bhattacharyya, P., 2022. HiNER: A Large Hindi Named Entity Recognition Dataset. arXiv preprint arXiv:2204.13743.
BiBTeX引用
latex @InProceedings{murthy-EtAl:2022:LREC, author = {Murthy, Rudra and Bhattacharjee, Pallab and Sharnagat, Rahul and Khatri, Jyotsana and Kanojia, Diptesh and Bhattacharyya, Pushpak}, title = {HiNER: A large Hindi Named Entity Recognition Dataset}, booktitle = {Proceedings of the Language Resources and Evaluation Conference}, month = {June}, year = {2022}, address = {Marseille, France}, publisher = {European Language Resources Association}, pages = {4467--4476}, abstract = {Named Entity Recognition (NER) is a foundational NLP task that aims to provide class labels like Person, Location, Organisation, Time, and Number to words in free text. Named Entities can also be multi-word expressions where the additional I-O-B annotation information helps label them during the NER annotation process. While English and European languages have considerable annotated data for the NER task, Indian languages lack on that front- both in terms of quantity and following annotation standards. This paper releases a significantly sized standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags. We discuss the dataset statistics in all their essential detail and provide an in-depth analysis of the NER tag-set used with our data. The statistics of tag-set in our dataset shows a healthy per-tag distribution especially for prominent classes like Person, Location and Organisation. Since the proof of resource-effectiveness is in building models with the resource and testing the model on benchmark data and against the leader-board entries in shared tasks, we do the same with the aforesaid data. We use different language models to perform the sequence labelling task for NER and show the efficacy of our data by performing a comparative evaluation with models trained on another dataset available for the Hindi NER task. Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper. To the best of our knowledge, no available dataset meets the standards of volume (amount) and variability (diversity), as far as Hindi NER is concerned. We fill this gap through this work, which we hope will significantly help NLP for Hindi. We release this dataset with our code and models for further research at https://github.com/cfiltnlp/HiNER}, url = {https://aclanthology.org/2022.lrec-1.475} }




