five

alexliap/tinystories-gr

收藏
Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/alexliap/tinystories-gr
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - el - en license: cdla-sharing-1.0 task_categories: - translation - text-generation task_ids: - language-modeling pretty_name: TinyStories-GR size_categories: - 1M<n<10M source_datasets: - roneneldan/TinyStories tags: - greek - children-stories - translation - synthetic --- # TinyStories-GR A full Modern Greek translation of the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset (~2.1 million short English children's stories), with AI-generated quality scores for each translation. ## Dataset Description TinyStories-GR was generated by running the entire TinyStories corpus through a two-stage AI pipeline: 1. **Translation** — each English story was translated to Modern Greek by Google Gemini (`gemini-3.1-flash-lite-preview`) 2. **Evaluation** — each translation was independently scored (1–5) by OpenAI GPT-4o-mini The pipeline is fully open-source and available at: **[https://github.com/alexliap/tinystories-gr](https://github.com/alexliap/tinystories-gr)** ### Dataset Summary | Property | Value | |----------|-------| | Source dataset | roneneldan/TinyStories | | Source language | English | | Target language | Modern Greek (el) | | Stories translated | ~2.1 million | | Translation model | Google Gemini `gemini-3.1-flash-lite-preview` | | Evaluation model | OpenAI `gpt-4o-mini` | | Evaluation scale | 1–5 (5 = excellent) | ## Schema | Column | Type | Nullable | Description | |--------|------|----------|-------------| | `row_id` | int64 | No | Unique row identifier (matches source row index) | | `original_text` | string | No | English source story | | `greek_translation` | string | No | Modern Greek translation | | `evaluation_score` | int8 | Yes | Translation quality score (1–5); null if ungraded | | `evaluation_reasoning` | string | Yes | Evaluator's explanation; null if ungraded | | `processing_timestamp` | datetime | No | UTC time the row was completed | | `processing_attempts` | int32 | No | Number of API attempts before success | | `source_file` | string | No | Source parquet shard filename | | `source_row_index` | int64 | No | Row index within the source shard | ## Translation Prompt Each story was translated with the instruction: ``` Translate the following English children's story to Modern Greek. Return only the translation, no commentary. ``` ## Evaluation Prompt ``` Each translated story was evaluated with a structured prompt asking the model to: - Compare the Greek translation to the English original - Score the translation 1–5 (accuracy, fluency, preservation of meaning) - Provide a brief reasoning for the score ``` ## Intended Uses - Training and fine-tuning Greek language models - Low-resource language research (Greek is moderately under-resourced for children's story corpora) - Studying the translation capabilities of LLMs on simple narrative text ## Limitations - Translations are AI-generated and not human-reviewed - Evaluation scores are also AI-generated (GPT-4o-mini) and may not perfectly reflect human judgment ## License This dataset is a derivative of [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories), which is released under [CDLA-Sharing 1.0](https://cdla.dev/sharing-1-0/). As a share-alike license, CDLA-Sharing 1.0 requires that any derivative dataset be shared under the same terms. Accordingly, TinyStories-GR is released under **CDLA-Sharing 1.0**. ## Citation If you use this dataset, please cite the original TinyStories paper and acknowledge this dataset: ```bibtex @article{eldan2023tinystories, title={TinyStories: How Small Can Language Models Be and Still Speak Coherent English?}, author={Eldan, Ronen and Li, Yuanzhi}, journal={arXiv preprint arXiv:2305.07759}, year={2023} } ``` And reference the generation pipeline: > TinyStories-GR dataset generated using https://github.com/alexliap/tinystories-gr
提供机构:
alexliap
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作