five

HunSum-2

收藏
arXiv2024-04-12 更新2024-06-21 收录
下载链接:
https://github.com/botondbarta/HunSum
下载链接
链接失效反馈
官方服务:
资源简介:
HunSum-2是由匈牙利计算机科学与控制研究所创建的一个开源匈牙利语数据集,专为提取式和摘要式文本摘要模型的训练设计。该数据集从Common Crawl语料库中精心筛选、清洗和去重而来,包含182万条文档。数据集的创建过程中,使用了多种预处理技术,如去除链接、图片说明和社交媒体嵌入等,以确保数据质量。HunSum-2的应用领域广泛,主要用于解决匈牙利语自动文本摘要的问题,支持模型在多个领域的实际应用。

HunSum-2 is an open-source Hungarian-language dataset developed by the Institute of Computer Science and Control of Hungary, specifically designed for training extractive and abstractive text summarization models. This dataset is carefully curated, cleaned and deduplicated from the Common Crawl corpus, containing 1.82 million documents. Multiple preprocessing techniques were adopted during its construction, such as removing links, image captions and social media embeds, to ensure high data quality. HunSum-2 has a wide range of application scenarios, which are mainly used to address the challenges of automatic text summarization in Hungarian, and support the practical deployment of models across multiple fields.
提供机构:
匈牙利计算机科学与控制研究所
创建时间:
2024-04-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作