classla/reldi_hr

Name: classla/reldi_hr
Creator: classla
Published: 2022-10-25 07:30:56
License: 暂无描述

Hugging Face2022-10-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/classla/reldi_hr

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - hr license: - cc-by-sa-4.0 task_categories: - other task_ids: - lemmatization - named-entity-recognition - part-of-speech tags: - structure-prediction - normalization - tokenization --- This dataset is based on 3,871 Croatian tweets that were segmented into sentences, tokens, and annotated with normalized forms, lemmas, MULTEXT-East tags (XPOS), UPOS tags and morphological features, and named entities. The dataset contains 6339 training samples (sentences), 815 validation samples and 785 test samples. Each sample represents a sentence and includes the following features: sentence ID ('sent\_id'), list of tokens ('tokens'), list of normalised tokens ('norms'), list of lemmas ('lemmas'), list of UPOS tags ('upos\_tags'), list of MULTEXT-East tags ('xpos\_tags), list of morphological features ('feats'), and list of named entity IOB tags ('iob\_tags'), which are encoded as class labels. If you are using this dataset in your research, please cite the following paper: ``` @article{Miličević_Ljubešić_2016, title={Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets}, volume={4}, url={https://revije.ff.uni-lj.si/slovenscina2/article/view/7007}, DOI={10.4312/slo2.0.2016.2.156-188}, number={2}, journal={Slovenščina 2.0: empirical, applied and interdisciplinary research}, author={Miličević, Maja and Ljubešić, Nikola}, year={2016}, month={Sep.}, pages={156–188} } ```

提供机构：

classla

原始信息汇总

数据集概述

基本信息

语言: 克罗地亚语 (hr)
许可证: CC BY-SA 4.0
任务类别: 其他 (other)
任务标识: 词形还原 (lemmatization), 命名实体识别 (named-entity-recognition), 词性标注 (part-of-speech)
标签: 结构预测 (structure-prediction), 规范化 (normalization), 分词 (tokenization)

数据集描述

来源: 3,871条克罗地亚语推文
处理: 分句、分词并进行规范化、词形还原、词性标注和命名实体识别
样本数量:
- 训练样本: 6339句
- 验证样本: 815句
- 测试样本: 785句
样本特征:
- 句子ID (sent_id)
- 分词列表 (tokens)
- 规范化分词列表 (norms)
- 词形还原列表 (lemmas)
- UPOS标签列表 (upos_tags)
- MULTEXT-East标签列表 (xpos_tags)
- 形态特征列表 (feats)
- 命名实体IOB标签列表 (iob_tags)

引用信息

论文标题: Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets
作者: Maja Miličević, Nikola Ljubešić
发表期刊: Slovenščina 2.0: empirical, applied and interdisciplinary research
卷号: 4
期号: 2
发表年份: 2016年9月
页码: 156-188
DOI: 10.4312/slo2.0.2016.2.156-188
URL: https://revije.ff.uni-lj.si/slovenscina2/article/view/7007

5,000+

优质数据集

54 个

任务类型

进入经典数据集