five

AsNER

收藏
arXiv2022-07-08 更新2024-06-21 收录
下载链接:
https://anonymous.4open.science/r/AsNER-04B3/
下载链接
链接失效反馈
官方服务:
资源简介:
AsNER数据集是由印度理工学院古瓦哈提创建的,专门为资源较少的阿萨姆语设计的命名实体识别数据集。该数据集包含约99k个标记,来源于印度总理的演讲和阿萨姆语戏剧。数据集涵盖了人名、地名和地址等多种实体类型,旨在为深度神经网络基础的阿萨姆语处理提供重要资源。创建过程中,数据集通过使用POS标记器和三位本地阿萨姆语演讲者及一位语言学家的标注,确保了标注的准确性。AsNER数据集的应用领域包括信息检索、文本理解、自动文本摘要等,旨在解决阿萨姆语在自然语言处理中的资源匮乏问题。

The AsNER dataset was developed by the Indian Institute of Technology Guwahati, as a dedicated named entity recognition (NER) dataset for the low-resource Assamese language. It contains approximately 99,000 annotated tokens, sourced from speeches delivered by the Prime Minister of India and Assamese dramas. The dataset covers multiple entity types including personal names, geographical locations, addresses and other categories, aiming to provide a critical resource for deep neural network-based Assamese language processing. To guarantee annotation accuracy, the dataset was constructed using a part-of-speech (POS) tagger and manual annotations from three local Assamese speakers and one professional linguist. The application scenarios of the AsNER dataset cover information retrieval, text understanding, automatic text summarization and other related fields, with the core goal of resolving the resource scarcity issue of Assamese in natural language processing (NLP).
提供机构:
印度理工学院古瓦哈提
创建时间:
2022-07-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作