recogna-nlp/recognasumm
收藏RecognaSumm 数据集
简介
RecognaSumm 是一个专门为葡萄牙语自动文本摘要任务设计的新颖且全面的数据库。该数据集因其多样化的来源而突出,包括从多个信息源(如机构和在线新闻门户)收集的新闻。数据集通过网络爬虫技术和精心策划构建,形成了一个涵盖各种主题和新闻风格的丰富且具有代表性的文档集合。RecognaSumm 的创建旨在填补葡萄牙语摘要研究中的重要空白,提供一个可用于开发和增强自动化摘要模型的训练和评估基础。
新闻类别
| 类别 | 新闻数量 |
|---|---|
| 巴西 | 14,131 |
| 经济 | 12,613 |
| 娱乐 | 5,337 |
| 健康 | 24,921 |
| 政策 | 29,909 |
| 科学与技术 | 15,135 |
| 体育 | 2,915 |
| 旅游与美食 | 2,893 |
| 世界 | 27,418 |
| 总计 | 135,272 |
引用
RecognaSumm: A Novel Brazilian Summarization Dataset (PROPOR 2024)
@inproceedings{paiola-etal-2024-recognasumm, title = "{R}ecogna{S}umm: A Novel {B}razilian Summarization Dataset", author = "Paiola, Pedro Henrique and Garcia, Gabriel Lino and Jodas, Danilo Samuel and Correia, Jo{~a}o Vitor Mariano and Sugi, Luis Afonso and Papa, Jo{~a}o Paulo", editor = "Gamallo, Pablo and Claro, Daniela and Teixeira, Ant{o}nio and Real, Livy and Garcia, Marcos and Oliveira, Hugo Gon{c{c}}alo and Amaro, Raquel", booktitle = "Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1", month = mar, year = "2024", address = "Santiago de Compostela, Galicia/Spain", publisher = "Association for Computational Lingustics", url = "https://aclanthology.org/2024.propor-1.63", pages = "575--579", }



