classla/reldi_hr
收藏Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/classla/reldi_hr
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- hr
license:
- cc-by-sa-4.0
task_categories:
- other
task_ids:
- lemmatization
- named-entity-recognition
- part-of-speech
tags:
- structure-prediction
- normalization
- tokenization
---
This dataset is based on 3,871 Croatian tweets that were segmented into sentences, tokens, and annotated with normalized forms, lemmas, MULTEXT-East tags (XPOS), UPOS tags and morphological features, and named entities.
The dataset contains 6339 training samples (sentences), 815 validation samples and 785 test samples.
Each sample represents a sentence and includes the following features: sentence ID ('sent\_id'),
list of tokens ('tokens'), list of normalised tokens ('norms'), list of lemmas ('lemmas'), list of UPOS tags ('upos\_tags'),
list of MULTEXT-East tags ('xpos\_tags), list of morphological features ('feats'),
and list of named entity IOB tags ('iob\_tags'), which are encoded as class labels.
If you are using this dataset in your research, please cite the following paper:
```
@article{Miličević_Ljubešić_2016,
title={Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets},
volume={4},
url={https://revije.ff.uni-lj.si/slovenscina2/article/view/7007},
DOI={10.4312/slo2.0.2016.2.156-188},
number={2},
journal={Slovenščina 2.0: empirical, applied and interdisciplinary research},
author={Miličević, Maja and Ljubešić, Nikola},
year={2016},
month={Sep.},
pages={156–188} }
```
提供机构:
classla
原始信息汇总
数据集概述
基本信息
- 语言: 克罗地亚语 (hr)
- 许可证: CC BY-SA 4.0
- 任务类别: 其他 (other)
- 任务标识: 词形还原 (lemmatization), 命名实体识别 (named-entity-recognition), 词性标注 (part-of-speech)
- 标签: 结构预测 (structure-prediction), 规范化 (normalization), 分词 (tokenization)
数据集描述
- 来源: 3,871条克罗地亚语推文
- 处理: 分句、分词并进行规范化、词形还原、词性标注和命名实体识别
- 样本数量:
- 训练样本: 6339句
- 验证样本: 815句
- 测试样本: 785句
- 样本特征:
- 句子ID (sent_id)
- 分词列表 (tokens)
- 规范化分词列表 (norms)
- 词形还原列表 (lemmas)
- UPOS标签列表 (upos_tags)
- MULTEXT-East标签列表 (xpos_tags)
- 形态特征列表 (feats)
- 命名实体IOB标签列表 (iob_tags)
引用信息
- 论文标题: Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets
- 作者: Maja Miličević, Nikola Ljubešić
- 发表期刊: Slovenščina 2.0: empirical, applied and interdisciplinary research
- 卷号: 4
- 期号: 2
- 发表年份: 2016年9月
- 页码: 156-188
- DOI: 10.4312/slo2.0.2016.2.156-188
- URL: https://revije.ff.uni-lj.si/slovenscina2/article/view/7007



