alexliap/tinystories-gr
收藏Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/alexliap/tinystories-gr
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- el
- en
license: cdla-sharing-1.0
task_categories:
- translation
- text-generation
task_ids:
- language-modeling
pretty_name: TinyStories-GR
size_categories:
- 1M<n<10M
source_datasets:
- roneneldan/TinyStories
tags:
- greek
- children-stories
- translation
- synthetic
---
# TinyStories-GR
A full Modern Greek translation of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset (~2.1 million short English children's stories), with AI-generated quality scores for each translation.
## Dataset Description
TinyStories-GR was generated by running the entire TinyStories corpus through a two-stage AI pipeline:
1. **Translation** — each English story was translated to Modern Greek by Google Gemini (`gemini-3.1-flash-lite-preview`)
2. **Evaluation** — each translation was independently scored (1–5) by OpenAI GPT-4o-mini
The pipeline is fully open-source and available at:
**[https://github.com/alexliap/tinystories-gr](https://github.com/alexliap/tinystories-gr)**
### Dataset Summary
| Property | Value |
|----------|-------|
| Source dataset | roneneldan/TinyStories |
| Source language | English |
| Target language | Modern Greek (el) |
| Stories translated | ~2.1 million |
| Translation model | Google Gemini `gemini-3.1-flash-lite-preview` |
| Evaluation model | OpenAI `gpt-4o-mini` |
| Evaluation scale | 1–5 (5 = excellent) |
## Schema
| Column | Type | Nullable | Description |
|--------|------|----------|-------------|
| `row_id` | int64 | No | Unique row identifier (matches source row index) |
| `original_text` | string | No | English source story |
| `greek_translation` | string | No | Modern Greek translation |
| `evaluation_score` | int8 | Yes | Translation quality score (1–5); null if ungraded |
| `evaluation_reasoning` | string | Yes | Evaluator's explanation; null if ungraded |
| `processing_timestamp` | datetime | No | UTC time the row was completed |
| `processing_attempts` | int32 | No | Number of API attempts before success |
| `source_file` | string | No | Source parquet shard filename |
| `source_row_index` | int64 | No | Row index within the source shard |
## Translation Prompt
Each story was translated with the instruction:
```
Translate the following English children's story to Modern Greek. Return only the translation, no commentary.
```
## Evaluation Prompt
```
Each translated story was evaluated with a structured prompt asking the model to:
- Compare the Greek translation to the English original
- Score the translation 1–5 (accuracy, fluency, preservation of meaning)
- Provide a brief reasoning for the score
```
## Intended Uses
- Training and fine-tuning Greek language models
- Low-resource language research (Greek is moderately under-resourced for children's story corpora)
- Studying the translation capabilities of LLMs on simple narrative text
## Limitations
- Translations are AI-generated and not human-reviewed
- Evaluation scores are also AI-generated (GPT-4o-mini) and may not perfectly reflect human judgment
## License
This dataset is a derivative of [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories), which is released under [CDLA-Sharing 1.0](https://cdla.dev/sharing-1-0/). As a share-alike license, CDLA-Sharing 1.0 requires that any derivative dataset be shared under the same terms. Accordingly, TinyStories-GR is released under **CDLA-Sharing 1.0**.
## Citation
If you use this dataset, please cite the original TinyStories paper and acknowledge this dataset:
```bibtex
@article{eldan2023tinystories,
title={TinyStories: How Small Can Language Models Be and Still Speak Coherent English?},
author={Eldan, Ronen and Li, Yuanzhi},
journal={arXiv preprint arXiv:2305.07759},
year={2023}
}
```
And reference the generation pipeline:
> TinyStories-GR dataset generated using https://github.com/alexliap/tinystories-gr
提供机构:
alexliap



