CIIRC-NLP/agree-cs

Name: CIIRC-NLP/agree-cs
Creator: CIIRC-NLP
Published: 2024-09-03 11:54:13
License: 暂无描述

Hugging Face2024-09-03 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/CIIRC-NLP/agree-cs

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - cs license: cc pretty_name: Czech grammar agreement dataset configs: - config_name: default data_files: - split: test path: data/test-* - config_name: few-shot-split data_files: - split: train path: few-shot-split/train-* - split: test path: few-shot-split/test-* dataset_info: - config_name: default features: - name: answer_idx dtype: int64 - name: choices sequence: string - name: sentence dtype: string splits: - name: test num_bytes: 91941 num_examples: 627 download_size: 52557 dataset_size: 91941 - config_name: few-shot-split features: - name: sentence dtype: string - name: choices sequence: string - name: answer_idx dtype: int64 splits: - name: train num_bytes: 2886 num_examples: 20 - name: test num_bytes: 89055 num_examples: 607 download_size: 55565 dataset_size: 91941 --- # Czech grammar agreement dataset (AGREE) This is an adapted and filtered test subset from the original [Czech grammar agreement dataset](https://nlp.fi.muni.cz/~xbaisa/agree/), designed to evaluate Czech language competence in the subject-verb agreement problem. Please respect the licensing and usage restrictions of the original dataset. The examples were transformed to accommodate a missing word selection task. Sentences containing more than one marked verb were discarded. In the remaining sentences, the marked verb was completely replaced with the "____" token. All five possible verb variants formed the list of available choices, and the index of the correct choice was stored as the label. Preblamatic examples were identified by gradually selecting examples wrongly answered by Claude 3 Haiku, Claude 3 Sonet and GPT-4 Turbo. These 115 examples were then manually checked and 46 of them were identified as ambiguous and removed from the dataset. This led to a final count of 627 evaluation samples. This dataset was created for use within the [Czech-Bench](https://gitlab.com/jirkoada/czech-bench) evaluation framework. ## Citation ```bibtex @PhdThesis{Baisa2016thesis, AUTHOR = "Baisa, Vít", TITLE = "Byte Level Language Models [online]", YEAR = "2016 [cit. 2024-08-28]", TYPE = "Disertační práce", SCHOOL = "Masarykova univerzita, Fakulta informatiky, Brno", NOTE = "SUPERVISOR : Karel Pala", URL = "https://is.muni.cz/th/en6ay/", } ```

提供机构：

CIIRC-NLP

5,000+

优质数据集

54 个

任务类型

进入经典数据集