jerteh/SrpKor4Tagging
收藏Hugging Face2024-03-25 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/jerteh/SrpKor4Tagging
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- token-classification
language:
- sr
pretty_name: SrpKor4Tagging training dataset
size_categories:
- 100K<n<1M
---
Corpus is created via mix of literary (⅓) and administrative (⅔) texts in Serbian.
It is tagged for POS for 2 tagsets: Universal POS tagset and SrpLemKor tagset (made according to traditional, descriptive Serbian grammar) and lemmatized
It is constituted of a single jsonl file that can be loaded via:
```python
from datasets import load_dataset
dataset = load_dataset("jerteh/SrpKor4Tagging")
```
Preview:
```python
ds = dataset["train"][1389]
for x, y, z in zip(ds["token"], ds["ud"], ds["lemma"]):
print(x, y, z)
Okrugle ADJ okrugao
mongolske ADJ mongolski
fizionomije NOUN fizionomija
behu AUX biti
ustupile VERB ustupiti
mesto NOUN mesto
licima NOUN lice
evropskijeg ADJ evropski
tipa NOUN tip
, PUNCT ,
prljavim ADJ prljav
, PUNCT ,
obradatelim ADJ obradateo
i CCONJ i
iscrpenim ADJ iscrpen
. PUNCT .
```
Citation:
```bibtex
@inproceedings{stankovic-etal-2020-machine,
title = "Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for {S}erbian",
author = "Stankovic, Ranka and
{\v{S}}andrih, Branislava and
Krstev, Cvetana and
Utvi{\'c}, Milo{\v{s}} and
Skoric, Mihailo",
editor = "Calzolari, Nicoletta and
B{\'e}chet, Fr{\'e}d{\'e}ric and
Blache, Philippe and
Choukri, Khalid and
Cieri, Christopher and
Declerck, Thierry and
Goggi, Sara and
Isahara, Hitoshi and
Maegaard, Bente and
Mariani, Joseph and
Mazo, H{\'e}l{\`e}ne and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2020.lrec-1.487",
pages = "3954--3962",
abstract = "The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment between Serbian morphological dictionaries, MULTEXT-East and Universal Part-of-Speech tagset. The trained models will be used to publish the new version of the Corpus of Contemporary Serbian as well as the Serbian literary corpus. The performance of developed taggers were compared and the impact of training set size was investigated, which resulted in around 98{\%} PoS-tagging precision per token for both new models. The sr{\_}basic annotated dataset will also be published.",
language = "English",
ISBN = "979-10-95546-34-4",
}
```
提供机构:
jerteh
原始信息汇总
数据集概述
基本信息
- 许可证: CC-BY-SA-4.0
- 任务类别: 词性标注
- 语言: 塞尔维亚语
- 数据集名称: SrpKor4Tagging 训练数据集
- 数据集大小: 100K<n<1M
数据集内容
- 数据来源: 混合了文学(⅓)和行政(⅔)文本的塞尔维亚语文本。
- 标注信息: 包含两种词性标注集:通用词性标注集和SrpLemKor标注集(根据传统的描述性塞尔维亚语法制定),并进行了词形还原。
数据格式
- 文件类型: 单一的jsonl文件
- 加载方式: python from datasets import load_dataset dataset = load_dataset("jerteh/SrpKor4Tagging")
引用信息
- 引用文献: bibtex @inproceedings{stankovic-etal-2020-machine, title = "Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for {S}erbian", author = "Stankovic, Ranka and {v{S}}andrih, Branislava and Krstev, Cvetana and Utvi{c}, Milo{v{s}} and Skoric, Mihailo", year = "2020", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", publisher = "European Language Resources Association" }



