nilc-nlp/mac_morpho
收藏数据集概述
数据集摘要
Mac-Morpho是一个巴西葡萄牙语文本的语料库,带有词性标注。其第一版于2003年发布,并进行了两次修订以提高资源质量。语料库分为训练、开发和测试部分,分别占总量的76%、4%和20%。
支持的任务和排行榜
- 任务类别: 词性标注
- 任务ID: 词性标注
语言
- 语言: 葡萄牙语
数据集结构
数据实例
一个来自Mac-Morpho数据集的示例如下: json { "id": "0", "pos_tags": [14, 19, 14, 15, 22, 7, 14, 9, 14, 9, 3, 15, 3, 3, 24], "tokens": ["Jersei", "atinge", "média", "de", "Cr$", "1,4", "milhão", "na", "venda", "da", "Pinhal", "em", "São", "Paulo", "."] }
数据字段
id: 样本IDtokens: 示例文本的词元pos_tags: 每个词元的词性标签
词性标签对应以下列表:
"PREP+PROADJ", "IN", "PREP+PRO-KS", "NPROP", "PREP+PROSUB", "KC", "PROPESS", "NUM", "PROADJ", "PREP+ART", "KS", "PRO-KS", "ADJ", "ADV-KS", "N", "PREP", "PROSUB", "PREP+PROPESS", "PDEN", "V", "PREP+ADV", "PCP", "CUR", "ADV", "PU", "ART"
数据分割
数据分为训练、验证和测试集。分割大小如下:
| 训练集 | 验证集 | 测试集 |
|---|---|---|
| 37948 | 1997 | 9987 |
数据集创建
数据来源
- 标注创建者: 专家生成
- 语言创建者: 发现
- 许可证: CC-BY-4.0
- 多语言性: 单语
- 大小类别: 10K<n<100K
- 源数据集: 原始
其他信息
许可证信息
- 许可证: CC-BY-4.0
引用信息
@article{fonseca2015evaluating, title={Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese}, author={Fonseca, Erick R and Rosa, Jo{~a}o Lu{\i}s G and Alu{\i}sio, Sandra Maria}, journal={Journal of the Brazilian Computer Society}, volume={21}, number={1}, pages={2}, year={2015}, publisher={Springer} }



