five

jerteh/SrpKor4Tagging

收藏
Hugging Face2024-03-25 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/jerteh/SrpKor4Tagging
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 task_categories: - token-classification language: - sr pretty_name: SrpKor4Tagging training dataset size_categories: - 100K<n<1M --- Corpus is created via mix of literary (⅓) and administrative (⅔) texts in Serbian. It is tagged for POS for 2 tagsets: Universal POS tagset and SrpLemKor tagset (made according to traditional, descriptive Serbian grammar) and lemmatized It is constituted of a single jsonl file that can be loaded via: ```python from datasets import load_dataset dataset = load_dataset("jerteh/SrpKor4Tagging") ``` Preview: ```python ds = dataset["train"][1389] for x, y, z in zip(ds["token"], ds["ud"], ds["lemma"]): print(x, y, z) Okrugle ADJ okrugao mongolske ADJ mongolski fizionomije NOUN fizionomija behu AUX biti ustupile VERB ustupiti mesto NOUN mesto licima NOUN lice evropskijeg ADJ evropski tipa NOUN tip , PUNCT , prljavim ADJ prljav , PUNCT , obradatelim ADJ obradateo i CCONJ i iscrpenim ADJ iscrpen . PUNCT . ``` Citation: ```bibtex @inproceedings{stankovic-etal-2020-machine, title = "Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for {S}erbian", author = "Stankovic, Ranka and {\v{S}}andrih, Branislava and Krstev, Cvetana and Utvi{\'c}, Milo{\v{s}} and Skoric, Mihailo", editor = "Calzolari, Nicoletta and B{\'e}chet, Fr{\'e}d{\'e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{\'e}l{\`e}ne and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.487", pages = "3954--3962", abstract = "The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment between Serbian morphological dictionaries, MULTEXT-East and Universal Part-of-Speech tagset. The trained models will be used to publish the new version of the Corpus of Contemporary Serbian as well as the Serbian literary corpus. The performance of developed taggers were compared and the impact of training set size was investigated, which resulted in around 98{\%} PoS-tagging precision per token for both new models. The sr{\_}basic annotated dataset will also be published.", language = "English", ISBN = "979-10-95546-34-4", } ```
提供机构:
jerteh
原始信息汇总

数据集概述

基本信息

  • 许可证: CC-BY-SA-4.0
  • 任务类别: 词性标注
  • 语言: 塞尔维亚语
  • 数据集名称: SrpKor4Tagging 训练数据集
  • 数据集大小: 100K<n<1M

数据集内容

  • 数据来源: 混合了文学(⅓)和行政(⅔)文本的塞尔维亚语文本。
  • 标注信息: 包含两种词性标注集:通用词性标注集和SrpLemKor标注集(根据传统的描述性塞尔维亚语法制定),并进行了词形还原。

数据格式

  • 文件类型: 单一的jsonl文件
  • 加载方式: python from datasets import load_dataset dataset = load_dataset("jerteh/SrpKor4Tagging")

引用信息

  • 引用文献: bibtex @inproceedings{stankovic-etal-2020-machine, title = "Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for {S}erbian", author = "Stankovic, Ranka and {v{S}}andrih, Branislava and Krstev, Cvetana and Utvi{c}, Milo{v{s}} and Skoric, Mihailo", year = "2020", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", publisher = "European Language Resources Association" }
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作