CIIRC-NLP/czech_news_simple-cs

Name: CIIRC-NLP/czech_news_simple-cs
Creator: CIIRC-NLP
Published: 2024-09-03 11:59:10
License: 暂无描述

Hugging Face2024-09-03 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/CIIRC-NLP/czech_news_simple-cs

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: odc-by configs: - config_name: default data_files: - split: test path: data/test-* - config_name: few-shot-split data_files: - split: train path: few-shot-split/train-* - split: test path: few-shot-split/test-* dataset_info: - config_name: default features: - name: url dtype: string - name: headline dtype: string - name: brief dtype: string - name: keywords sequence: string - name: category dtype: int64 - name: content dtype: string - name: category_unclean dtype: string splits: - name: test num_bytes: 3898211 num_examples: 1000 download_size: 2670339 dataset_size: 3898211 - config_name: few-shot-split features: - name: url dtype: string - name: authors sequence: string - name: headline dtype: string - name: brief dtype: string - name: keywords sequence: string - name: category dtype: int64 - name: content dtype: string - name: comments_num dtype: float64 - name: server dtype: int64 - name: category_unclean dtype: string - name: authors_gender sequence: int64 - name: authors_cum_gender dtype: int64 - name: day_of_week dtype: int64 - name: date dtype: timestamp[us] - name: __index_level_0__ dtype: int64 - name: __index_level_1__ dtype: int64 splits: - name: train num_bytes: 92198 num_examples: 20 - name: test num_bytes: 3887952 num_examples: 980 download_size: 2719237 dataset_size: 3980150 --- # Simplified Czech News dataset This is a simplified and subsampled test subset from the original [czech_news_dataset_v2](https://huggingface.co/datasets/hynky/czech_news_dataset_v2). Only 5 basic news categories are considered: 1. Zahraniční (Foreign) 2. Domácí (Local) 3. Sport (Sport) 4. Kultura (Culture) 5. Ekonomika (Economy) The test set includes 200 examples per category, 1000 examples in total. Apart from the category label, each example also contains the article's headline, brief summary, full textual content, optional keywords, original category specification, and URL. This dataset was created for use within the [Czech-Bench](https://gitlab.com/jirkoada/czech-bench) evaluation framework. ## Citation ```bibtex @misc{kydlíček2023datasetstrongbaselinesclassification, title={A Dataset and Strong Baselines for Classification of Czech News Texts}, author={Hynek Kydlíček and Jindřich Libovický}, year={2023}, eprint={2307.10666}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2307.10666}, } ```

提供机构：

CIIRC-NLP

5,000+

优质数据集

54 个

任务类型

进入经典数据集