honzatoegel/lola-gramma-de-en

Name: honzatoegel/lola-gramma-de-en
Creator: honzatoegel
Published: 2023-09-01 04:39:13
License: 暂无描述

Hugging Face2023-09-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/honzatoegel/lola-gramma-de-en

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - de - en tags: - Languages - Gramma size_categories: - n<1K --- # Dataset Card This gramma correction dataset is still work in progress! Do not use it for any serious LLM task - see Issues bellow. ## Dataset summary This dataset is used to finetune LLMs for German gramma correction for English speakers. ### Input An input is German sentence, which has potentially grammatical errors. ### Output Output is corrected sentence with minimal adjustments and list all gramma corrections and explanations. ### Dataset creation The incorrect input sentences was created manually, the correction was prehenetaded by GPT and then finally manually corrected. The focus was on explainable gramma rules, and high quality of data. ### Issues The main issue is the small amounth of data points, all trained LLMs do not generalize well. The aim is to make various categories of grammatical errors and then add more examples with data augmentation. #### Proposed gramma error categories (TODO) - Interpuncion - ex. missing comma, comma on wrong position - Wrong word order - Missing clause words (missing subject, object, verb,..) - Additional clause words which should not be used - Misspelling & Typos - Conjugation of verbs - wrong person, wrong tense - Declination of nouns+articles - Wrong article, wrong case - Wrong prepositions/adjectives in given clause meaning Each category should have at least 15-20 datapoints for training, and 5 for evaluation.

提供机构：

honzatoegel

原始信息汇总

数据集卡片

此语法修正数据集仍在进行中！请勿将其用于任何重要的LLM任务 - 见下方问题。

数据集概述

该数据集用于微调LLMs，以帮助英语母语者修正德语语法错误。

输入

输入是一个可能包含语法错误的德语句子。

输出

输出是一个经过最小调整的修正句子，并列出所有语法修正和解释。

数据集创建

不正确的输入句子是手动创建的，修正由GPT预处理，最后手动校正。重点在于可解释的语法规则和高数据质量。

问题

主要问题是数据点数量较少，所有训练的LLMs泛化能力不佳。目标是制作各种语法错误类别，然后通过数据增强添加更多示例。

建议的语法错误类别（待办）

标点符号 - 例如，缺少逗号，逗号位置错误
错误的词序
缺少从句词（缺少主语、宾语、动词等）
应避免使用的额外从句词
拼写错误和打字错误
动词的错误变位 - 错误的人称，错误的时态
名词和冠词的错误变格 - 错误的冠词，错误的格
给定从句含义中的错误介词/形容词

每个类别应至少有15-20个训练数据点，5个用于评估。

5,000+

优质数据集

54 个

任务类型

进入经典数据集