PORTULAN/parlamento-pt
收藏Hugging Face2023-05-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/PORTULAN/parlamento-pt
下载链接
链接失效反馈官方服务:
资源简介:
ParlamentoPT是一个葡萄牙语数据集,通过收集葡萄牙议会公开的辩论记录文档创建。该数据集旨在用于训练Albertina-PT*语言模型,支持葡萄牙语的基础模型研究。数据集的创建没有人工标注,来源于原始数据,任务类别包括文本生成和填充掩码,任务ID包括语言建模和掩码语言建模。该数据集由里斯本大学和波尔图大学合作开发。
ParlamentoPT is a Portuguese-language dataset constructed by collecting publicly available debate transcripts from the Portuguese Parliament. This dataset is designed for training the Albertina-PT* language model, supporting fundamental research on Portuguese language models. No manual annotation was conducted during its creation, as it is sourced from raw data. Its task categories include text generation and mask filling, while its task IDs cover language modeling and masked language modeling. This dataset was developed in collaboration between the University of Lisbon and the University of Porto.
提供机构:
PORTULAN
原始信息汇总
数据集概述
基本信息
- 名称: ParlamentoPT
- 语言: 葡萄牙语(pt)
- 许可证: 其他
- 多语言性: 单语种
- 大小: 1M<n<10M
- 来源: 原始数据
任务与应用
- 任务类别:
- 文本生成
- 填空(fill-mask)
- 任务ID:
- 语言建模
- 掩码语言建模
数据集用途
- 用途: 用于训练Albertina-PT*语言模型
- 合作机构: 里斯本大学与波尔图大学
数据来源
- 来源: 葡萄牙议会门户网站,遵循其开放数据政策
引用信息
- 引用文献: arXiv:2305.06721
- 作者: João Rodrigues, Luís Gomes, João Silva, António Branco, Rodrigo Santos, Henrique Lopes Cardoso, Tomás Osório
- 标题: Advancing Neural Encoding of Portuguese with Transformer Albertina PT-*
- 年份: 2023



