classla/janes_tag
收藏Hugging Face2022-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/classla/janes_tag
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- si
license:
- cc-by-sa-4.0
task_categories:
- other
task_ids:
- lemmatization
- part-of-speech
tags:
- structure-prediction
- normalization
- tokenization
---
The dataset contains 6273 training samples, 762 validation samples and 749 test samples.
Each sample represents a sentence and includes the following features: sentence ID ('sent\_id'),
list of tokens ('tokens'), list of normalised word forms ('norms'), list of lemmas ('lemmas'),
list of Multext-East tags ('xpos\_tags), list of morphological features ('feats'),
and list of UPOS tags ('upos\_tags'), which are encoded as class labels.
提供机构:
classla
原始信息汇总
数据集概述
基本信息
- 语言: 僧伽罗语 (si)
- 许可证: CC-BY-SA-4.0
- 任务类别: 其他
- 任务ID:
- 词形归并 (lemmatization)
- 词性标注 (part-of-speech)
- 标签:
- 结构预测 (structure-prediction)
- 规范化 (normalization)
- 分词 (tokenization)
数据集组成
- 训练样本: 6273个
- 验证样本: 762个
- 测试样本: 749个
样本特征
每个样本代表一个句子,包含以下特征:
- 句子ID (sent_id)
- 词列表 (tokens)
- 规范化词形列表 (norms)
- 词干列表 (lemmas)
- Multext-East词性标签列表 (xpos_tags)
- 形态特征列表 (feats)
- UPOS标签列表 (upos_tags),作为类别标签编码



