five

neuralnets/multilingual-tinystories

收藏
Hugging Face2026-03-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/neuralnets/multilingual-tinystories
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - as - doi - gom - gu - kn - mai - ml - mni - ne - or - pa - sa - sat - sd - ta - te - ur multilinguality: - multilingual size_categories: - 10K<n<100K task_categories: - text-generation - fill-mask task_ids: - language-modeling pretty_name: Multilingual TinyStories tags: - stories - children - indian-languages - indic - low-resource license: cc-by-4.0 --- # Multilingual TinyStories Dataset A collection of children's stories in multiple Indian languages, generated for language model training. ## Dataset Details ### Currently Available Languages This dataset contains stories in 17 Indic languages: - **Assamese (`as`)**: 4,875 stories, 3,088,287 tokens - **Dogri (`doi`)**: 4,924 stories, 2,556,071 tokens - **GOM (`gom`)**: 4,879 stories, 2,437,488 tokens - **Gujarati (`gu`)**: 12,856 stories, 9,858,511 tokens - **Kannada (`kn`)**: 11,644 stories, 9,890,334 tokens - **Maithili (`mai`)**: 4,872 stories, 2,363,974 tokens - **Malayalam (`ml`)**: 11,216 stories, 9,742,815 tokens - **Manipuri (`mni`)**: 4,870 stories, 71,024 tokens - **Nepali (`ne`)**: 4,863 stories, 2,309,707 tokens - **Odia (`or`)**: 13,004 stories, 9,506,384 tokens - **Punjabi (`pa`)**: 13,144 stories, 9,669,977 tokens - **Sanskrit (`sa`)**: 4,873 stories, 2,605,271 tokens - **Santali (`sat`)**: 4,883 stories, 6,555,546 tokens - **Sindhi (`sd`)**: 4,881 stories, 2,029,536 tokens - **Tamil (`ta`)**: 12,860 stories, 9,840,128 tokens - **Telugu (`te`)**: 10,924 stories, 9,865,743 tokens - **Urdu (`ur`)**: 3,374 stories, 1,519,067 tokens **Total stories**: 132,942 **Total tokens**: 93,909,863 > **Note**: Bengali, Marathi, and Hindi are excluded as they already have extensive resources available via the Regional TinyStories by Vizuara. ### Dataset Structure The dataset is organized by language splits. Each split contains stories in that specific language. ```python from datasets import load_dataset # Load all languages dataset = load_dataset("neuralnets/multilingual-tinystories") # Load specific language dataset = load_dataset("neuralnets/multilingual-tinystories", split="gu") # Gujarati ``` ### Data Fields * `text`: The story text in the respective language (native script) * `index`: Unique identifier for each story in format `{lang_code}_{number}` (e.g., `gu_00001`, `kn_00523`) ### Usage Example ```python from datasets import load_dataset # Load Gujarati stories gujarati_stories = load_dataset("neuralnets/multilingual-tinystories", split="gu") # Print first story print(gujarati_stories[0]["text"]) print(f"Index: {gujarati_stories[0]['index']}") # Output: gu_00000 # Load all languages all_stories = load_dataset("neuralnets/multilingual-tinystories") print(all_stories.keys()) # dict_keys(['gu', 'kn', 'ml', ...]) # Filter by language using index gujarati_only = [story for story in gujarati_stories if story['index'].startswith('gu_')] ``` ### Current Statistics | Code | Language | Stories | Tokens | Status | | --- | --- | --- | --- | --- | | `as` | Assamese | 4,875 | 3,088,287 | ✅ Available | | `doi` | Dogri | 4,924 | 2,556,071 | ✅ Available | | `gom` | GOM | 4,879 | 2,437,488 | ✅ Available | | `gu` | Gujarati | 12,856 | 9,858,511 | ✅ Available | | `kn` | Kannada | 11,644 | 9,890,334 | ✅ Available | | `mai` | Maithili | 4,872 | 2,363,974 | ✅ Available | | `ml` | Malayalam | 11,216 | 9,742,815 | ✅ Available | | `mni` | Manipuri | 4,870 | 71,024 | ✅ Available | | `ne` | Nepali | 4,863 | 2,309,707 | ✅ Available | | `or` | Odia | 13,004 | 9,506,384 | ✅ Available | | `pa` | Punjabi | 13,144 | 9,669,977 | ✅ Available | | `sa` | Sanskrit | 4,873 | 2,605,271 | ✅ Available | | `sat` | Santali | 4,883 | 6,555,546 | ✅ Available | | `sd` | Sindhi | 4,881 | 2,029,536 | ✅ Available | | `ta` | Tamil | 12,860 | 9,840,128 | ✅ Available | | `te` | Telugu | 10,924 | 9,865,743 | ✅ Available | | `ur` | Urdu | 3,374 | 1,519,067 | ✅ Available | ## Dataset Creation This dataset was created using language models to generate simple children's stories in various Indian languages, suitable for training small language models. ### Curation Process 1. **Generation**: Stories generated using Sarvam AI models 2. **Cleaning**: Removed emojis, English words, and formatting artifacts 3. **Native Scripts**: All stories are in their native scripts (Gujarati, Kannada, Malayalam, Devanagari, etc.) 4. **Quality**: Each story is a complete, coherent narrative suitable for children ### Index Format Each story has a unique index in the format `{{language_code}}_{{number:05d}}`: * `gu_00000` - First Gujarati story * `kn_01234` - 1235th Kannada story * `ml_00099` - 100th Malayalam story This format allows easy identification and filtering by language. ## Use Cases * Training small language models for Indian languages * Multilingual language model research * Cross-lingual transfer learning * Educational applications * Low-resource language modeling ## Limitations * Stories are generated, not human-written * May contain cultural or linguistic inaccuracies * Not reviewed by native speakers * Limited to simple children's story vocabulary ## Licensing Please check individual language regulations and usage rights for your specific use case. ## Updates Follow [@neuralnets](https://huggingface.co/neuralnets) for updates on this dataset and future projects. ## Citation ```bibtex @dataset{multilingual_tinystories_2026, title={Multilingual TinyStories: Indic Language Stories Dataset}, author={NeuralNets}, year={2026}, publisher={Hugging Face}, howpublished={\url{[https://huggingface.co/datasets/neuralnets/multilingual-tinystories](https://huggingface.co/datasets/neuralnets/multilingual-tinystories)}} } ``` ## Contact For questions, issues, or contributions, please open an issue on the dataset repository.
提供机构:
neuralnets
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作