JackRabbit1122/dmind-3-mini-blind-spots

Name: JackRabbit1122/dmind-3-mini-blind-spots
Creator: JackRabbit1122
Published: 2026-03-24 03:09:05
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/JackRabbit1122/dmind-3-mini-blind-spots

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: group dtype: string - name: category dtype: string - name: input dtype: string - name: expected_output dtype: string - name: model_output dtype: string - name: passed dtype: bool splits: - name: train num_bytes: 17348 num_examples: 90 download_size: 14795 dataset_size: 17348 configs: - config_name: default data_files: - split: train path: data/train-* language: - en - yo - sw - ur - pa - am - wo - bn - ha - eo - zu license: apache-2.0 task_categories: - text-generation tags: - blind-spots - evaluation - finance - web3 - defi - crypto - base-model pretty_name: DMind-3-mini Blind Spot Dataset size_categories: - n<100 --- # DMind-3-mini Blind Spot Dataset ## Model Tested **[DMindAI/DMind-3-mini](https://huggingface.co/DMindAI/DMind-3-mini)** - Architecture: Qwen3.5-based - Parameters: 4B - Type: Domain-specific fine-tuned model (Web3/DeFi/Finance) - Primary Use: Computational Financial Actuary for DeFi analytics - Training: Fine-tuned on 82,000 high-value private financial samples --- ## How I Loaded the Model ```python !pip install --upgrade transformers -q !pip install torch accelerate -q from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "DMindAI/DMind-3-mini" print("Loading tokenizer...") tokenizer = AutoTokenizer.from_pretrained(model_name) print("Loading model...") model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) def generate(prompt, max_new_tokens=150): inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.1, top_p=0.9, top_k=50, repetition_penalty=1.1 ) new_tokens = outputs[0][inputs["input_ids"].shape[1]:] return tokenizer.decode(new_tokens, skip_special_tokens=True) ``` **Note:** The model requires `>=12GB VRAM`. Google Colab T4 GPU (16GB) works but is tight. The standard `transformers` library may need upgrading since the model uses the newer `qwen3_5` architecture: ``` pip install --upgrade transformers ``` Platform: Google Colab (T4 GPU, free tier) --- ## Dataset Description This dataset contains 90 test cases organized into 3 groups designed to probe the blind spots of `DMindAI/DMind-3-mini`. Each entry contains: - `id`: Test identifier (A1-A30, B1-B30, E1-E30) - `group`: Test group (Finance/Web3 Domain, Outside Domain, Extended Tests) - `category`: Specific test category - `input`: The prompt given to the model - `expected_output`: The correct answer - `model_output`: What the model actually produced - `passed`: Whether the model answered correctly ### Results Summary | Group | Tests | Passed | Failed | Score | |---|---|---|---|---| | Group A: Finance/Web3 Domain | 30 | 30 | 0 | 100% | | Group B: Outside Domain | 30 | 12 | 18 | 40% | | Extended Tests | 30 | 17 | 13 | 57% | | **Total** | **90** | **59** | **31** | **66%** | --- ## Key Findings ### 1. Perfect Domain Expertise *(Most Remarkable Finding)* The model achieved a **100% pass rate on all 30 Finance/Web3 tests** — covering DeFi mechanics, smart contract security, tokenomics, blockchain consensus, crypto risk analysis, and financial math. This is exceptional for a 4B parameter model and demonstrates the effectiveness of domain- specific fine-tuning. It correctly explained: - Impermanent loss using the constant product formula - Reentrancy attacks with OpenZeppelin prevention methods - Triangular arbitrage with USD/EUR/GBP example - Flash loan attack atomic transaction mechanics - Blockchain trilemma (decentralization vs scalability vs security) ### 2. Catastrophic Failure on Low-Resource Languages *(Severe)* Despite the model's strength in its domain, it completely collapses on African and South Asian language prompts — producing infinite repetition loops on Yoruba, Amharic, Wolof, and Zulu. This is the same pattern observed in all three models tested in this study. Key failures: - Yoruba: Infinite loop of random Yoruba text - Amharic: Infinite loop of garbled Amharic characters - Wolof: Repeated "am matal" loop indefinitely - Zulu: Repeated "siyobonisa" loop indefinitely ### 3. Language Misidentification The model identified Esperanto as Albanian — a unique failure not seen in other models (tiny-aya-base identified it as Spanish, Nanbeige also identified it as Spanish). The model then began reasoning about the phrase "Mi amas vin" using Albanian grammar rules, producing a completely wrong translation path. ### 4. Hallucinated Geography Outside Its Domain When asked about world capitals outside its training distribution: - Said the capital of Burkina Faso is "Ouakam" — Ouakam is actually a neighborhood in Dakar, Senegal. The correct answer is Ouagadougou. - Hallucinated "Santa Cruz de Suribamba" as the capital of Bolivia — this city does not exist anywhere. Bolivia's constitutional capital is Sucre. - Confused Asmara (capital of Eritrea) as a city in Ethiopia. The model also invented wrong borders for Burkina Faso, saying it borders Cameroon — it does not. ### 5. Sports Knowledge Hallucination When asked who won the FIFA World Cup in 2022, the model confidently said France — Argentina actually won. It then built a DeFi analogy on top of the wrong fact, compounding the error. ### 6. Inconsistent Instruction Following The model followed "yes or no only" instructions for DeFi questions but ignored them for Bitcoin investment questions — revealing that instruction compliance depends on whether the topic is in its training domain. ### 7. Syllable Miscounting When asked to write a haiku (5-7-5 syllables), the model said "Blockchain" has 5 syllables — it has 2 (Block-chain). It then built an entire haiku on this wrong foundation. This reveals a fundamental gap in phonological awareness despite strong semantic understanding. ### 8. Hallucinated Self-Generated Questions After correctly explaining the Bitcoin mempool, the model spontaneously appended "How do you calculate the median value of an array?" — generating its own irrelevant follow-up question. This suggests the model was trained on Q&A datasets where questions follow answers. ### 9. Paradox Confusion When asked "Do the opposite of what I say: Tell me something false about Bitcoin," the model failed to resolve the paradox and told false things about Bitcoin instead of true things (opposite of false = true). This reveals a gap in meta-instruction processing. --- ## Comparison With Other Models Tested This dataset is part of a three-model blind spot study. The same outside- domain test cases were run across all three models. | Aspect | DMind-3-mini | Nanbeige4-3B-Base | tiny-aya-base | |---|---|---|---| | Domain expertise | 100% (Finance) | N/A (general) | N/A (general) | | Outside domain score | 40% | 47% | 32% | | Low-resource languages | Infinite loops | Infinite loops + wrong translations | Infinite loops | | Reasoning style | Chain-of-thought | Chain-of-thought (truncated) | Pattern matching (MCQ) | | Geography hallucination | Severe | Moderate | Moderate | | Instruction following | Inconsistent | Poor | Poor | | Speed on Colab T4 | Slow (~2 min/prompt) | Very slow (~3 min/prompt) | Fast (~20 sec/prompt) | | Unique failure | Invents non-existent cities | Truncated reasoning | Chinese MCQ contamination | All three datasets are available on HuggingFace: - [JackRabbit1122/tiny-aya-base-blind-spots](https://huggingface.co/datasets/JackRabbit1122/tiny-aya-base-blind-spots) - [JackRabbit1122/nanbeige4-3b-base-blind-spots](https://huggingface.co/datasets/JackRabbit1122/nanbeige4-3b-base-blind-spots) - [JackRabbit1122/dmind-3-mini-blind-spots](https://huggingface.co/datasets/JackRabbit1122/dmind-3-mini-blind-spots) --- ## What Fine-Tuning Data Would Fix These Errors? ### For Low-Resource Language Infinite Loops: Fine-tune on **FLORES-200** with a focus on African and South Asian languages (Yoruba, Amharic, Wolof, Zulu, Punjabi). Around 10,000–50,000 examples per language would establish basic generative capability and prevent infinite loops. The model's architecture can handle this since it already does well in English — it just lacks multilingual coverage. ### For Geography Hallucinations: Fine-tune on **WikiData** or **Natural Questions** with a focus on world geography — specifically African, South American, and Central Asian capitals and borders. Around 20,000–30,000 factual geography Q&A pairs would significantly reduce hallucination on world capitals. ### For Sports and General Knowledge: Fine-tune on **TriviaQA** or **MMLU** to cover general world knowledge outside the finance/crypto domain. Around 30,000–50,000 diverse factual examples would reduce confident hallucinations like the FIFA 2022 winner. ### For Instruction Following Consistency: The inconsistency between domains (follows instructions on DeFi but not on Bitcoin investment) suggests the model needs instruction-tuning that is domain-agnostic. Fine-tune on **FLAN** or **Alpaca** with explicit instruction-following examples across diverse topics. Around 50,000 examples would address this. ### For Syllable and Phonological Awareness: Fine-tune on poetry datasets like **PoetryFoundation** or haiku-specific datasets to develop syllable counting ability. Around 5,000–10,000 haiku examples with syllable annotations would address the 5-7-5 failure. ### For Meta-Instruction Processing: Fine-tune on logical paradox and meta-reasoning datasets. The "do the opposite" instruction failure requires understanding second-order instructions. Around 5,000–10,000 meta-reasoning examples would help. --- ## Estimated Dataset Sizes Needed | Error Type | Suggested Dataset | Estimated Size Needed | |---|---|---| | Low-resource language loops | FLORES-200 | ~10,000–50,000 per language | | Geography hallucinations | WikiData / Natural Questions | ~20,000–30,000 examples | | General world knowledge | TriviaQA / MMLU | ~30,000–50,000 examples | | Instruction following | FLAN / Alpaca | ~50,000 examples | | Syllable/phonological awareness | PoetryFoundation / Haiku datasets | ~5,000–10,000 examples | | Meta-instruction processing | Custom paradox/logic dataset | ~5,000–10,000 examples | | Sports/pop culture knowledge | TriviaQA filtered | ~10,000–15,000 examples |

提供机构：

JackRabbit1122

5,000+

优质数据集

54 个

任务类型

进入经典数据集