Mindigenous/MINDI-1.5-training-data
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Mindigenous/MINDI-1.5-training-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
tags:
- code-generation
- nextjs
- react
- typescript
- vision
- multimodal
- mindi
- mindigenous
size_categories:
- 1M<n<10M
---
# MINDI 1.5 Training Data
Training dataset for **MINDI 1.5 Vision-Coder** by MINDIGENOUS.AI
## Dataset Statistics
| Metric | Value |
|--------|-------|
| Total examples | 1,449,428 |
| Total tokens | 859,694,776 |
| Avg tokens/example | 593 |
| Avg quality score | 6.49 |
| Sources | 9 |
## Splits
| Split | Examples | Percentage |
|-------|----------|------------|
| Train | 1,304,486 | 90.0% |
| Validation | 72,471 | 5.0% |
| Test | 72,471 | 5.0% |
## Sources
| Source | Examples | Kept % |
|--------|----------|--------|
| starcoderdata | 569,350 | 94.9% |
| websight | 250,987 | 99.99% |
| evol_code | 155,998 | 99.7% |
| codefeedback | 149,865 | 99.9% |
| magicoder | 149,987 | 99.99% |
| synthetic_nextjs | 90,000 | 100% (protected) |
| codealpaca | 59,241 | 98.8% |
| search_examples | 15,000 | 100% (protected) |
| sandbox_examples | 9,000 | 100% (protected) |
## Type Distribution
| Type | Examples |
|------|----------|
| code_generation | 1,183,441 |
| vision_code | 250,987 |
| search | 15,000 |
## Language Distribution
| Language | Examples |
|----------|----------|
| unknown | 490,305 |
| typescript | 375,859 |
| javascript | 298,497 |
| python | 211,842 |
| html | 36,371 |
| java | 32,458 |
| rust | 3,709 |
| go | 387 |
## Format
Each example is a JSON object with:
- `conversations`: list of `{"role": ..., "content": ...}` turns
- `source`: dataset origin
- `type`: code_generation / vision_code / search
- `language`: programming language
- `quality_score`: heuristic quality (0-10+)
- `token_count`: number of tokens
## Quality Filtering
- Protected sources (sandbox, search, synthetic_nextjs) bypass aggressive filters
- MINDI special token bonuses boost agentic examples
- Dedup via SHA-256 content hashing
- Rejection reasons: too_many_tokens (30,637), boilerplate (1,373), duplicate (59)
## Built By
Faaz - MINDIGENOUS.AI
Mumbai, India — April 2026
提供机构:
Mindigenous



