mideind/gec-test-set
收藏Hugging Face2024-04-30 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/mideind/gec-test-set
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- is
---
Test data for Icelandic spell and grammar checking, created as part of the Icelandic Language Technology Programme.
The test data is divided into three different formats, type 1, 2 and 3. For every original file corrected, three files are included in the test data when possible: _original, _corrected and _metadata. The original and metadata files are always .txt files, but the format of the corrected file differs between types.
Texts corrected are from the News2 subcorpus of the Icelandic Gigaword Corpus (http://hdl.handle.net/20.500.12537/238) and the Icelandic Common Crawl Corpus (IC3) (https://huggingface.co/datasets/mideind/icelandic-common-crawl-corpus-IC3). The News2 corpus is published under an IGC-Corpus License, which does not allow third-party publishing, so original texts corrected are not published, but titles of the original corpus files are provided for every IGC text in the test data. Original texts from IC3 are published as part of this data.
The 'get_corrected_igc.py' script is provided for users to obtain IGC texts corrected according to type 2 test data. Its first argument should be a test set file, e.g. 'IGC_022' and its second argument should be the path to the user's IGC-News2-22.10.TEI path. The script prints out the IGC text corrected according to the test data.
Type 1 has error spans marked and nothing beyond that. The corrected file is in .jsonl format with elements "id", "text", "entities", "relations" and "Comments". "id" is an automatically generated ID for the text and "text" shows the original text when it is possible. "entities" lists the marked spans, with an ID, label, start_offset and end_offset. "relations" and "Comments" are empty.
In type 2, errors are corrected and the corrected file is in .txt format. For texts taken from IGC-News2, the corrected file is a diff-file showing only corrections to the original file.
In type 3, errors are marked, corrected, given a severity score and the correction is explained. The corrected file is in .ann format which shows all aforementioned elements. Three lines in the corrected .ann file represent one annotated error. The first line consists of an annotation ID, followed by a TAB character and then a severity score. The severity score is on a scale of 1 to 5, 1 being a minor error and 5 being a severe error. Following this is the annotation span, which shows the start-offset and the end-offset. The start-offset is the index of the first character of the annotated span in the original .txt file while the end-offset is the index of the first character after the annotated span. The character in the end-offset position is therefore not included in the annotation span. Following this is a TAB character and the text which was annotated in the original .txt file. Line no. 2 for the annotated error shows an event ID for the annotation followed by a TAB character, then the severity score repeated, followed by a colon and the annotation ID from line no. 1. The third and final line starts with a reference to the annotation ID number, i.e. #1 for the T1 annotation ID. Following this is a TAB character and then AnnotatorNotes referencing the respective event ID of the annotation. A TAB character follows and then the corrected text for the annotation, a '\' character and then an argument or explanation for the correction in question. This explanation is provided by the proofreader and usually consists of one or two sentences.
Each type serves a different purpose in evaluating a spell and grammar checking system. Type 1, with span marking, can be used to evaluate how well a model detects spans which may contain errors. This can be used for evaluating error detection accuracy, i.e. grammatical error detection, and general error-finding capabilities of large language models. Type 2, with errors marked and corrected, can be used to evaluate grammatical error detection and correction, i.e. both error detection and error correction accuracy. This data can be used to calculate automatic evaluation scores, such as GLEU. Type 3 enables the computation of error detection and error correction accuracy and the additional information of an explanation to a correction and a severity score can be used when training and evaluating future large language models.
Metadata files contain the following information:
- Text genre
- Citation
- Type of test data
- Error category (discussed below)
- Has the original text been modified to include an error of a particular category?
- Name of the proofreader
- The text’s author, if available
Error categories
Texts in the test set are collected based on a few pre-defined error categories, with a focus on error types which relate to the context. Each error text in the set contains at least one error which falls under one of these categories, and information on which error category it falls under is listed in the metadata file. Although we focus on one error category in the texts, all other errors are marked and/or corrected. Corrected files of type 1 show each annotation with the label 'Villa' (Eng. 'error') and this is merely a general error category denoting an annotation without any further information. The error category, or categories, noted in the metadata file is the main error category which pertains to the text in question. This is the test set's only error categorization since the test set focuses on whole texts and context within them.
The error categories are the following:
- Idiomatic expressions.
- Frequent errors, e.g. word space errors, punctuation, wrong spelling, incorrect case, capitalization or a wrong prepositions used.
- Context in the text, e.g. consistent choice of words or correct personal pronouns used throughout the text.
- Errors relating to cohesion or coherence.
- Semantic analysis, i.e. errors that depend on meaning.
Size
Type 1:
- IC3 texts are 160 and original files consist of 74,308 words.
- IGC-News2 texts are 189 and original files consist of 122,989 words.
Type 2:
- IC3 texts are 98 and original files consist of 47,872 words.
- IGC-News2 texts are 257 and original files consist of 101,325 words.
Type 3:
- IC3 files are 131 and original files consist of 31,455 words.
提供机构:
mideind
原始信息汇总
数据集概述
数据集来源与用途
- 来源: 数据集由Icelandic Language Technology Programme创建,用于测试冰岛语的拼写和语法检查。
- 用途: 评估拼写和语法检查系统的性能。
数据集内容
- 数据格式: 数据集包含三种类型(Type 1, 2, 3),每种类型针对不同的评估目的。
- Type 1: 包含错误跨度标记,文件格式为.jsonl。
- Type 2: 包含错误标记和修正,文件格式为.txt。
- Type 3: 包含错误标记、修正、严重性评分和修正解释,文件格式为.ann。
- 数据源: 来自News2子语料库的Icelandic Gigaword Corpus和Icelandic Common Crawl Corpus (IC3)。
数据集结构
- 文件类型: 每个原始文件修正后,可能包含三个文件:_original.txt, _corrected, _metadata.txt。
- 修正文件格式:
- Type 1: .jsonl格式,包含"id", "text", "entities", "relations"和"Comments"。
- Type 2: .txt格式,显示对原始文件的修正。
- Type 3: .ann格式,详细记录每个错误的修正、严重性评分和解释。
元数据信息
- 包含信息: 文本类型、引用、测试数据类型、错误类别、是否为特定错误类别修改、校对者名称、作者(如有)。
错误类别
- 定义: 数据集中的文本基于预定义的错误类别收集,每个文本至少包含一个属于这些类别的错误。
- 类别: 习语表达、常见错误、文本上下文、连贯性或一致性错误、语义分析错误。
数据集大小
- Type 1:
- IC3: 160个文本,74,308字。
- IGC-News2: 189个文本,122,989字。
- Type 2:
- IC3: 98个文本,47,872字。
- IGC-News2: 257个文本,101,325字。
- Type 3:
- IC3: 131个文件,31,455字。



