five

classla/reldi_hr

收藏
Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/classla/reldi_hr
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - hr license: - cc-by-sa-4.0 task_categories: - other task_ids: - lemmatization - named-entity-recognition - part-of-speech tags: - structure-prediction - normalization - tokenization --- This dataset is based on 3,871 Croatian tweets that were segmented into sentences, tokens, and annotated with normalized forms, lemmas, MULTEXT-East tags (XPOS), UPOS tags and morphological features, and named entities. The dataset contains 6339 training samples (sentences), 815 validation samples and 785 test samples. Each sample represents a sentence and includes the following features: sentence ID ('sent\_id'), list of tokens ('tokens'), list of normalised tokens ('norms'), list of lemmas ('lemmas'), list of UPOS tags ('upos\_tags'), list of MULTEXT-East tags ('xpos\_tags), list of morphological features ('feats'), and list of named entity IOB tags ('iob\_tags'), which are encoded as class labels. If you are using this dataset in your research, please cite the following paper: ``` @article{Miličević_Ljubešić_2016, title={Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets}, volume={4}, url={https://revije.ff.uni-lj.si/slovenscina2/article/view/7007}, DOI={10.4312/slo2.0.2016.2.156-188}, number={2}, journal={Slovenščina 2.0: empirical, applied and interdisciplinary research}, author={Miličević, Maja and Ljubešić, Nikola}, year={2016}, month={Sep.}, pages={156–188} } ```
提供机构:
classla
原始信息汇总

数据集概述

基本信息

  • 语言: 克罗地亚语 (hr)
  • 许可证: CC BY-SA 4.0
  • 任务类别: 其他 (other)
  • 任务标识: 词形还原 (lemmatization), 命名实体识别 (named-entity-recognition), 词性标注 (part-of-speech)
  • 标签: 结构预测 (structure-prediction), 规范化 (normalization), 分词 (tokenization)

数据集描述

  • 来源: 3,871条克罗地亚语推文
  • 处理: 分句、分词并进行规范化、词形还原、词性标注和命名实体识别
  • 样本数量:
    • 训练样本: 6339句
    • 验证样本: 815句
    • 测试样本: 785句
  • 样本特征:
    • 句子ID (sent_id)
    • 分词列表 (tokens)
    • 规范化分词列表 (norms)
    • 词形还原列表 (lemmas)
    • UPOS标签列表 (upos_tags)
    • MULTEXT-East标签列表 (xpos_tags)
    • 形态特征列表 (feats)
    • 命名实体IOB标签列表 (iob_tags)

引用信息

  • 论文标题: Tviterasi, tviteraši or twitteraši? Producing and analysing a normalised dataset of Croatian and Serbian tweets
  • 作者: Maja Miličević, Nikola Ljubešić
  • 发表期刊: Slovenščina 2.0: empirical, applied and interdisciplinary research
  • 卷号: 4
  • 期号: 2
  • 发表年份: 2016年9月
  • 页码: 156-188
  • DOI: 10.4312/slo2.0.2016.2.156-188
  • URL: https://revije.ff.uni-lj.si/slovenscina2/article/view/7007
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作