koalareads/kr-vocab-synth-20260327-v1

Name: koalareads/kr-vocab-synth-20260327-v1
Creator: koalareads
Published: 2026-04-04 09:56:47
License: 暂无描述

Hugging Face2026-04-04 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/koalareads/kr-vocab-synth-20260327-v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - de - en license: mit task_categories: - conversational - text-generation tags: - vocabulary - language-learning - german - synthetic - education pretty_name: KoalaReads German Vocabulary Trainer Conversations size_categories: - 1K<n<10K --- # KoalaReads German Vocabulary Trainer - Synthetic Conversations ## Dataset Description This dataset contains synthetic conversational data for training a vocabulary trainer model for German language learning. The conversations follow CEFR (Common European Framework of Reference) levels A1-C2 and include various exercise types designed to reinforce vocabulary acquisition through structured pedagogical interactions. ### Related Datasets This dataset is generated from vocabulary in: [`koalareads/german-vocabulary-cefr-20260327`](https://huggingface.co/datasets/koalareads/german-vocabulary-cefr-20260327) ## Dataset Statistics - **Total Training Records**: 2,000 - **Total Vocabulary Words**: 3,092 - **Total Messages**: 8,051 - **Average Messages per Record**: 4.03 - **Lexical Error Rate**: 22.75% - **Generation Date**: 2026-04-04T11:56:44.360079 ### CEFR Level Distribution | Level | Records | Percentage | |-------|---------|------------| | A1 | 165 | 8.2% | | A2 | 442 | 22.1% | | B1 | 474 | 23.7% | | B2 | 645 | 32.2% | | C1 | 160 | 8.0% | | C2 | 114 | 5.7% | ### Exercise Type Distribution | Exercise Type | Count | |---------------|-------| | `T1_direct_translation` | 508 | | `T5_sentence_production` | 503 | | `T3_situational_recall` | 485 | | `T4_reverse_translation` | 474 | | `MN_noun_module` | 12 | | `T2_sentence_completion_block` | 10 | | `MV_verb_module` | 4 | | `review_block` | 3 | | `MA_semantic_cluster` | 1 | ## Visualizations ### Exercise Types ![Exercise Types](hf_dataset/exercise_types.png) ### Cefr Distribution ![Cefr Distribution](hf_dataset/cefr_distribution.png) ### Pos Distribution ![Pos Distribution](hf_dataset/pos_distribution.png) ## Sample Data Here are a few randomly selected examples from the dataset: ### Example 1 **Exercise Type**: `T1_direct_translation` **CEFR Level**: A2 **Target Word**: lang ```json { "messages": [ { "role": "system", "content": "You are a vocabulary trainer. Focus on exactly ONE target word per task.\nKeep instructions short. After the learner answers, give immediate feedback:\n- say if it's correct (lexically)\n- give the correct answer (and the full correct sentence when relevant)\n- briefly reinforce the meaning/usage\nGrammar mistakes alone do not count as lexical errors.\nNever mention timing, pauses, or scheduling." }, { "role": "assistant", "content": "How do you say \"long\" in German?" }, { "role": "user", "content": "lang" }, { "role": "assistant", "content": "✅ Correct: \"lang\".\nMeaning: \"long\".\nExample: \"" } ], "metadata": { "schema_version": "kr_vocab_synth_v1", "cefr": "A2", "trainer": "A1-A2", "mode": "audio", "stage": 1, "exercise_type": "T1_direct_translation", "target_word": "lang", "pos": "adj", "lexical_error_count": 0, "grammar_error_present": false, "meaning_demonstrated": true, "noise_stripped": true } } ``` ### Example 2 **Exercise Type**: `T1_direct_translation` **CEFR Level**: B2 **Target Word**: elegant ```json { "messages": [ { "role": "system", "content": "You are a vocabulary trainer. Focus on exactly ONE target word per task.\nKeep instructions short. After the learner answers, give immediate feedback:\n- say if it's correct (lexically)\n- give the correct answer (and the full correct sentence when relevant)\n- briefly reinforce the meaning/usage\nGrammar mistakes alone do not count as lexical errors.\nNever mention timing, pauses, or scheduling." }, { "role": "assistant", "content": "How do you say \"elegant\" in German?" }, { "role": "user", "content": "I don't know." }, { "role": "assistant", "content": "❌ Not quite. The correct word is \"elegant\".\nMeaning: \"elegant\".\nTry again: How do you say \"elegant\" in German?" } ], "metadata": { "schema_version": "kr_vocab_synth_v1", "cefr": "B2", "trainer": "B1-C2", "mode": "audio", "stage": 1, "exercise_type": "T1_direct_translation", "target_word": "elegant", "pos": "adj", "lexical_error_count": 1, "grammar_error_present": false, "meaning_demonstrated": false, "noise_stripped": true } } ``` ### Example 3 **Exercise Type**: `T1_direct_translation` **CEFR Level**: B2 **Target Word**: Zahlungsweise ```json { "messages": [ { "role": "system", "content": "You are a vocabulary trainer. Focus on exactly ONE target word per task.\nKeep instructions short. After the learner answers, give immediate feedback:\n- say if it's correct (lexically)\n- give the correct answer (and the full correct sentence when relevant)\n- briefly reinforce the meaning/usage\nGrammar mistakes alone do not count as lexical errors.\nNever mention timing, pauses, or scheduling." }, { "role": "assistant", "content": "How do you say \"payment method\" in German?" }, { "role": "user", "content": "Zahlungsweise" }, { "role": "assistant", "content": "✅ Correct: \"Zahlungsweise\".\nMeaning: \"payment method\".\nExample: i" } ], "metadata": { "schema_version": "kr_vocab_synth_v1", "cefr": "B2", "trainer": "B1-C2", "mode": "audio", "stage": 1, "exercise_type": "T1_direct_translation", "target_word": "Zahlungsweise", "pos": "noun", "lexical_error_count": 0, "grammar_error_present": false, "meaning_demonstrated": true, "noise_stripped": true } } ``` ## Dataset Structure ### Data Fields Each record contains: - `messages`: Array of conversation messages in OpenAI chat format - `role`: One of `system`, `assistant`, or `user` - `content`: Message text content - `metadata`: Training metadata including: - `schema_version`: Dataset schema version - `cefr`: CEFR level (A1-C2) - `trainer`: Trainer type (A1-A2 or B1-C2) - `mode`: Training mode (typically 'audio') - `stage`: Progressive stage number - `exercise_type`: Type of vocabulary exercise - `target_word`: German word being trained - `pos`: Part of speech - `lexical_error_count`: Number of lexical errors in learner responses - `grammar_error_present`: Boolean indicating grammar errors - `meaning_demonstrated`: Boolean indicating if meaning was demonstrated ### Exercise Types The dataset includes the following exercise types: 1. **T1_direct_translation**: Translate English phrase to German 2. **T2_sentence_completion_block**: Fill in blanks with correct words (consolidation after 3 words) 3. **T3_situational_recall**: Choose appropriate word for given situation 4. **T4_reverse_translation**: Translate German word to English 5. **T5_sentence_production**: Create sentence using target word 6. **MN_noun_translation**: Noun-specific translation with article 7. **MV_verb_translation**: Verb-specific translation with conjugation 8. **MA_semantic_cluster**: Choose between semantically similar words 9. **review_block**: Mixed review of 5-7 previously learned words ## Usage ### Loading the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset('YOUR_USERNAME/kr-vocab-synth') # Load specific split train_data = load_dataset('YOUR_USERNAME/kr-vocab-synth', split='train') ``` ### Example Use Case This dataset is designed for fine-tuning small language models to act as vocabulary trainers. The conversational format allows the model to learn: - How to give immediate, structured feedback - When to correct lexical vs. grammar errors - How to reinforce vocabulary with examples - How to adapt difficulty based on CEFR levels ## Citation If you use this dataset, please cite: ```bibtex @dataset{koalareads_vocab_synth, title = {KoalaReads German Vocabulary Trainer - Synthetic Conversations}, year = {2026}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/datasets/YOUR_USERNAME/kr-vocab-synth}} } ``` ## License MIT License - Free to use for educational and research purposes.

提供机构：

koalareads

5,000+

优质数据集

54 个

任务类型

进入经典数据集