five

Albanian part-of-speech corpus

收藏
arXiv2019-12-03 更新2024-06-21 收录
下载链接:
https://github.com/NeldaKote/Albanian-POS
下载链接
链接失效反馈
官方服务:
资源简介:
本数据集名为'Albanian part-of-speech corpus',由Polytechnic University of Tirana创建,是首个公开的阿尔巴尼亚语词性标注和形态学标注语料库。该数据集包含约118,000个自然文本中的标记,来源于不同文本源,如小说、语法书、网络爬虫和维基百科。此外,还包括67,000个人工创建的简单句子,仅用于训练。数据集的创建遵循Universal Dependencies的形态学标注方案,用于训练和评估分词、形态学标注和词形还原模型。该数据集旨在解决阿尔巴尼亚语在自然语言处理资源方面的缺乏问题,特别是在形态学标注和词形还原工具方面。

This dataset, named "Albanian part-of-speech corpus", was created by the Polytechnic University of Tirana. It is the first publicly available Albanian corpus for part-of-speech tagging and morphological annotation. The dataset contains approximately 118,000 annotated tokens from natural texts, sourced from various text types including novels, grammar books, web-crawled data, and Wikipedia. Additionally, it includes 67,000 manually created simple sentences solely for training purposes. The dataset follows the Universal Dependencies morphological annotation schema, and is used for training and evaluating models for tokenization, morphological annotation, and lemmatization. This dataset aims to address the shortage of natural language processing resources for Albanian, particularly in terms of morphological annotation and lemmatization tools.
提供机构:
Polytechnic University of Tirana
创建时间:
2019-12-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作