koalareads/kr-vocab-synth-20260327-v1
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/koalareads/kr-vocab-synth-20260327-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- de
- en
license: mit
task_categories:
- conversational
- text-generation
tags:
- vocabulary
- language-learning
- german
- synthetic
- education
pretty_name: KoalaReads German Vocabulary Trainer Conversations
size_categories:
- 1K<n<10K
---
# KoalaReads German Vocabulary Trainer - Synthetic Conversations
## Dataset Description
This dataset contains synthetic conversational data for training a vocabulary trainer model for German language learning.
The conversations follow CEFR (Common European Framework of Reference) levels A1-C2 and include various exercise types
designed to reinforce vocabulary acquisition through structured pedagogical interactions.
### Related Datasets
This dataset is generated from vocabulary in: [`koalareads/german-vocabulary-cefr-20260327`](https://huggingface.co/datasets/koalareads/german-vocabulary-cefr-20260327)
## Dataset Statistics
- **Total Training Records**: 2,000
- **Total Vocabulary Words**: 3,092
- **Total Messages**: 8,051
- **Average Messages per Record**: 4.03
- **Lexical Error Rate**: 22.75%
- **Generation Date**: 2026-04-04T11:56:44.360079
### CEFR Level Distribution
| Level | Records | Percentage |
|-------|---------|------------|
| A1 | 165 | 8.2% |
| A2 | 442 | 22.1% |
| B1 | 474 | 23.7% |
| B2 | 645 | 32.2% |
| C1 | 160 | 8.0% |
| C2 | 114 | 5.7% |
### Exercise Type Distribution
| Exercise Type | Count |
|---------------|-------|
| `T1_direct_translation` | 508 |
| `T5_sentence_production` | 503 |
| `T3_situational_recall` | 485 |
| `T4_reverse_translation` | 474 |
| `MN_noun_module` | 12 |
| `T2_sentence_completion_block` | 10 |
| `MV_verb_module` | 4 |
| `review_block` | 3 |
| `MA_semantic_cluster` | 1 |
## Visualizations
### Exercise Types

### Cefr Distribution

### Pos Distribution

## Sample Data
Here are a few randomly selected examples from the dataset:
### Example 1
**Exercise Type**: `T1_direct_translation`
**CEFR Level**: A2
**Target Word**: lang
```json
{
"messages": [
{
"role": "system",
"content": "You are a vocabulary trainer. Focus on exactly ONE target word per task.\nKeep instructions short. After the learner answers, give immediate feedback:\n- say if it's correct (lexically)\n- give the correct answer (and the full correct sentence when relevant)\n- briefly reinforce the meaning/usage\nGrammar mistakes alone do not count as lexical errors.\nNever mention timing, pauses, or scheduling."
},
{
"role": "assistant",
"content": "How do you say \"long\" in German?"
},
{
"role": "user",
"content": "lang"
},
{
"role": "assistant",
"content": "✅ Correct: \"lang\".\nMeaning: \"long\".\nExample: \""
}
],
"metadata": {
"schema_version": "kr_vocab_synth_v1",
"cefr": "A2",
"trainer": "A1-A2",
"mode": "audio",
"stage": 1,
"exercise_type": "T1_direct_translation",
"target_word": "lang",
"pos": "adj",
"lexical_error_count": 0,
"grammar_error_present": false,
"meaning_demonstrated": true,
"noise_stripped": true
}
}
```
### Example 2
**Exercise Type**: `T1_direct_translation`
**CEFR Level**: B2
**Target Word**: elegant
```json
{
"messages": [
{
"role": "system",
"content": "You are a vocabulary trainer. Focus on exactly ONE target word per task.\nKeep instructions short. After the learner answers, give immediate feedback:\n- say if it's correct (lexically)\n- give the correct answer (and the full correct sentence when relevant)\n- briefly reinforce the meaning/usage\nGrammar mistakes alone do not count as lexical errors.\nNever mention timing, pauses, or scheduling."
},
{
"role": "assistant",
"content": "How do you say \"elegant\" in German?"
},
{
"role": "user",
"content": "I don't know."
},
{
"role": "assistant",
"content": "❌ Not quite. The correct word is \"elegant\".\nMeaning: \"elegant\".\nTry again: How do you say \"elegant\" in German?"
}
],
"metadata": {
"schema_version": "kr_vocab_synth_v1",
"cefr": "B2",
"trainer": "B1-C2",
"mode": "audio",
"stage": 1,
"exercise_type": "T1_direct_translation",
"target_word": "elegant",
"pos": "adj",
"lexical_error_count": 1,
"grammar_error_present": false,
"meaning_demonstrated": false,
"noise_stripped": true
}
}
```
### Example 3
**Exercise Type**: `T1_direct_translation`
**CEFR Level**: B2
**Target Word**: Zahlungsweise
```json
{
"messages": [
{
"role": "system",
"content": "You are a vocabulary trainer. Focus on exactly ONE target word per task.\nKeep instructions short. After the learner answers, give immediate feedback:\n- say if it's correct (lexically)\n- give the correct answer (and the full correct sentence when relevant)\n- briefly reinforce the meaning/usage\nGrammar mistakes alone do not count as lexical errors.\nNever mention timing, pauses, or scheduling."
},
{
"role": "assistant",
"content": "How do you say \"payment method\" in German?"
},
{
"role": "user",
"content": "Zahlungsweise"
},
{
"role": "assistant",
"content": "✅ Correct: \"Zahlungsweise\".\nMeaning: \"payment method\".\nExample: i"
}
],
"metadata": {
"schema_version": "kr_vocab_synth_v1",
"cefr": "B2",
"trainer": "B1-C2",
"mode": "audio",
"stage": 1,
"exercise_type": "T1_direct_translation",
"target_word": "Zahlungsweise",
"pos": "noun",
"lexical_error_count": 0,
"grammar_error_present": false,
"meaning_demonstrated": true,
"noise_stripped": true
}
}
```
## Dataset Structure
### Data Fields
Each record contains:
- `messages`: Array of conversation messages in OpenAI chat format
- `role`: One of `system`, `assistant`, or `user`
- `content`: Message text content
- `metadata`: Training metadata including:
- `schema_version`: Dataset schema version
- `cefr`: CEFR level (A1-C2)
- `trainer`: Trainer type (A1-A2 or B1-C2)
- `mode`: Training mode (typically 'audio')
- `stage`: Progressive stage number
- `exercise_type`: Type of vocabulary exercise
- `target_word`: German word being trained
- `pos`: Part of speech
- `lexical_error_count`: Number of lexical errors in learner responses
- `grammar_error_present`: Boolean indicating grammar errors
- `meaning_demonstrated`: Boolean indicating if meaning was demonstrated
### Exercise Types
The dataset includes the following exercise types:
1. **T1_direct_translation**: Translate English phrase to German
2. **T2_sentence_completion_block**: Fill in blanks with correct words (consolidation after 3 words)
3. **T3_situational_recall**: Choose appropriate word for given situation
4. **T4_reverse_translation**: Translate German word to English
5. **T5_sentence_production**: Create sentence using target word
6. **MN_noun_translation**: Noun-specific translation with article
7. **MV_verb_translation**: Verb-specific translation with conjugation
8. **MA_semantic_cluster**: Choose between semantically similar words
9. **review_block**: Mixed review of 5-7 previously learned words
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset('YOUR_USERNAME/kr-vocab-synth')
# Load specific split
train_data = load_dataset('YOUR_USERNAME/kr-vocab-synth', split='train')
```
### Example Use Case
This dataset is designed for fine-tuning small language models to act as vocabulary trainers.
The conversational format allows the model to learn:
- How to give immediate, structured feedback
- When to correct lexical vs. grammar errors
- How to reinforce vocabulary with examples
- How to adapt difficulty based on CEFR levels
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{koalareads_vocab_synth,
title = {KoalaReads German Vocabulary Trainer - Synthetic Conversations},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/datasets/YOUR_USERNAME/kr-vocab-synth}}
}
```
## License
MIT License - Free to use for educational and research purposes.
提供机构:
koalareads



