five

Mindigenous/MINDI-1.5-training-data

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Mindigenous/MINDI-1.5-training-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en tags: - code-generation - nextjs - react - typescript - vision - multimodal - mindi - mindigenous size_categories: - 1M<n<10M --- # MINDI 1.5 Training Data Training dataset for **MINDI 1.5 Vision-Coder** by MINDIGENOUS.AI ## Dataset Statistics | Metric | Value | |--------|-------| | Total examples | 1,449,428 | | Total tokens | 859,694,776 | | Avg tokens/example | 593 | | Avg quality score | 6.49 | | Sources | 9 | ## Splits | Split | Examples | Percentage | |-------|----------|------------| | Train | 1,304,486 | 90.0% | | Validation | 72,471 | 5.0% | | Test | 72,471 | 5.0% | ## Sources | Source | Examples | Kept % | |--------|----------|--------| | starcoderdata | 569,350 | 94.9% | | websight | 250,987 | 99.99% | | evol_code | 155,998 | 99.7% | | codefeedback | 149,865 | 99.9% | | magicoder | 149,987 | 99.99% | | synthetic_nextjs | 90,000 | 100% (protected) | | codealpaca | 59,241 | 98.8% | | search_examples | 15,000 | 100% (protected) | | sandbox_examples | 9,000 | 100% (protected) | ## Type Distribution | Type | Examples | |------|----------| | code_generation | 1,183,441 | | vision_code | 250,987 | | search | 15,000 | ## Language Distribution | Language | Examples | |----------|----------| | unknown | 490,305 | | typescript | 375,859 | | javascript | 298,497 | | python | 211,842 | | html | 36,371 | | java | 32,458 | | rust | 3,709 | | go | 387 | ## Format Each example is a JSON object with: - `conversations`: list of `{"role": ..., "content": ...}` turns - `source`: dataset origin - `type`: code_generation / vision_code / search - `language`: programming language - `quality_score`: heuristic quality (0-10+) - `token_count`: number of tokens ## Quality Filtering - Protected sources (sandbox, search, synthetic_nextjs) bypass aggressive filters - MINDI special token bonuses boost agentic examples - Dedup via SHA-256 content hashing - Rejection reasons: too_many_tokens (30,637), boilerplate (1,373), duplicate (59) ## Built By Faaz - MINDIGENOUS.AI Mumbai, India — April 2026
提供机构:
Mindigenous
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作