NERSkill.Id
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://data.mendeley.com/datasets/5s8r9ndfvc
下载链接
链接失效反馈官方服务:
资源简介:
NERSkill.Id stands out as the initial annotated corpus designed specifically for NER datasets emphasizing skill entities in the Indonesian language. This marks a valuable addition to the existing resources for Natural Language Processing (NLP) in Indonesian. Despite its relatively compact size, NERSkill.Id holds considerable promise for refining language models. Moreover, its integration with larger pre-existing corpora can enhance the training of more extensive and versatile mixed Indonesian models tailored for diverse NLP tasks.
The dataset categorizes named entities into three distinct classes: hard skill, soft skill, and technology. It consists of 418.868 tokens. Subsequently, these tokens are marked using the BIO format. The annotation table is presented in ConLL2003 format, consisting of three columns: Sentence#, word, and tag columns.
*We already have paper at https://www.sciencedirect.com/science/article/pii/S235234092400163X (please cite)
NERSkill.Id 是首个专门面向印尼语技能实体的命名实体识别(Named Entity Recognition, NER)数据集的标注语料库。这为印尼语自然语言处理(Natural Language Processing, NLP)领域的现有资源库增添了极具价值的新增内容。尽管其体量相对小巧,但NERSkill.Id在优化语言模型方面展现出可观的应用前景。此外,将其与更大规模的现有语料库相结合,能够助力训练更具广度与通用性的混合印尼语模型,以适配各类自然语言处理任务。
该数据集将命名实体划分为三个明确的类别:硬技能、软技能与技术类实体。该数据集共包含418,868个词元(Token)。随后,这些词元采用BIO标注格式进行标记。标注表格采用ConLL2003格式,共包含三列:句子编号、词项与标签列。
* 本数据集相关研究论文已发表于:https://www.sciencedirect.com/science/article/pii/S235234092400163X(敬请引用)
创建时间:
2024-04-02



