CZEch NEws Classification dataset (CZE-NEC)
收藏arXiv2023-07-20 更新2024-06-21 收录
下载链接:
https://github.com/hynky1999/Czech-News-Classification-dataset
下载链接
链接失效反馈官方服务:
资源简介:
CZE-NEC数据集是由查尔斯大学数学与物理学院形式与应用语言学研究所创建,包含自2000年至2022年间来自多个捷克新闻源的162万条新闻文章。数据集通过CommonCrawl档案收集,经过严格的筛选和处理,确保数据质量。CZE-NEC数据集定义了四个分类任务:新闻来源、新闻类别、推断作者性别和发布日期。该数据集旨在为捷克语自然语言处理模型提供一个全面的评估平台,特别是在处理长文本和多任务场景中。
The CZE-NEC dataset was created by the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University. It contains 1.62 million news articles from multiple Czech news sources between 2000 and 2022. The dataset was collected via the CommonCrawl archive, and underwent rigorous filtering and processing to ensure data quality. The CZE-NEC dataset defines four classification tasks: news source classification, news category classification, author gender inference, and publication date prediction. This dataset aims to provide a comprehensive evaluation platform for Czech natural language processing models, particularly in scenarios involving long texts and multi-task learning.
提供机构:
查尔斯大学数学与物理学院形式与应用语言学研究所
创建时间:
2023-07-20



