JackRabbit1122/dmind-3-mini-blind-spots
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JackRabbit1122/dmind-3-mini-blind-spots
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: group
dtype: string
- name: category
dtype: string
- name: input
dtype: string
- name: expected_output
dtype: string
- name: model_output
dtype: string
- name: passed
dtype: bool
splits:
- name: train
num_bytes: 17348
num_examples: 90
download_size: 14795
dataset_size: 17348
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
language:
- en
- yo
- sw
- ur
- pa
- am
- wo
- bn
- ha
- eo
- zu
license: apache-2.0
task_categories:
- text-generation
tags:
- blind-spots
- evaluation
- finance
- web3
- defi
- crypto
- base-model
pretty_name: DMind-3-mini Blind Spot Dataset
size_categories:
- n<100
---
# DMind-3-mini Blind Spot Dataset
## Model Tested
**[DMindAI/DMind-3-mini](https://huggingface.co/DMindAI/DMind-3-mini)**
- Architecture: Qwen3.5-based
- Parameters: 4B
- Type: Domain-specific fine-tuned model (Web3/DeFi/Finance)
- Primary Use: Computational Financial Actuary for DeFi analytics
- Training: Fine-tuned on 82,000 high-value private financial samples
---
## How I Loaded the Model
```python
!pip install --upgrade transformers -q
!pip install torch accelerate -q
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "DMindAI/DMind-3-mini"
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
def generate(prompt, max_new_tokens=150):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.1,
top_p=0.9,
top_k=50,
repetition_penalty=1.1
)
new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(new_tokens, skip_special_tokens=True)
```
**Note:** The model requires `>=12GB VRAM`. Google Colab T4 GPU (16GB)
works but is tight. The standard `transformers` library may need upgrading
since the model uses the newer `qwen3_5` architecture:
```
pip install --upgrade transformers
```
Platform: Google Colab (T4 GPU, free tier)
---
## Dataset Description
This dataset contains 90 test cases organized into 3 groups designed to
probe the blind spots of `DMindAI/DMind-3-mini`. Each entry contains:
- `id`: Test identifier (A1-A30, B1-B30, E1-E30)
- `group`: Test group (Finance/Web3 Domain, Outside Domain, Extended Tests)
- `category`: Specific test category
- `input`: The prompt given to the model
- `expected_output`: The correct answer
- `model_output`: What the model actually produced
- `passed`: Whether the model answered correctly
### Results Summary
| Group | Tests | Passed | Failed | Score |
|---|---|---|---|---|
| Group A: Finance/Web3 Domain | 30 | 30 | 0 | 100% |
| Group B: Outside Domain | 30 | 12 | 18 | 40% |
| Extended Tests | 30 | 17 | 13 | 57% |
| **Total** | **90** | **59** | **31** | **66%** |
---
## Key Findings
### 1. Perfect Domain Expertise *(Most Remarkable Finding)*
The model achieved a **100% pass rate on all 30 Finance/Web3 tests** —
covering DeFi mechanics, smart contract security, tokenomics, blockchain
consensus, crypto risk analysis, and financial math. This is exceptional
for a 4B parameter model and demonstrates the effectiveness of domain-
specific fine-tuning. It correctly explained:
- Impermanent loss using the constant product formula
- Reentrancy attacks with OpenZeppelin prevention methods
- Triangular arbitrage with USD/EUR/GBP example
- Flash loan attack atomic transaction mechanics
- Blockchain trilemma (decentralization vs scalability vs security)
### 2. Catastrophic Failure on Low-Resource Languages *(Severe)*
Despite the model's strength in its domain, it completely collapses on
African and South Asian language prompts — producing infinite repetition
loops on Yoruba, Amharic, Wolof, and Zulu. This is the same pattern
observed in all three models tested in this study. Key failures:
- Yoruba: Infinite loop of random Yoruba text
- Amharic: Infinite loop of garbled Amharic characters
- Wolof: Repeated "am matal" loop indefinitely
- Zulu: Repeated "siyobonisa" loop indefinitely
### 3. Language Misidentification
The model identified Esperanto as Albanian — a unique failure not seen
in other models (tiny-aya-base identified it as Spanish, Nanbeige also
identified it as Spanish). The model then began reasoning about the phrase
"Mi amas vin" using Albanian grammar rules, producing a completely wrong
translation path.
### 4. Hallucinated Geography Outside Its Domain
When asked about world capitals outside its training distribution:
- Said the capital of Burkina Faso is "Ouakam" — Ouakam is actually a
neighborhood in Dakar, Senegal. The correct answer is Ouagadougou.
- Hallucinated "Santa Cruz de Suribamba" as the capital of Bolivia — this
city does not exist anywhere. Bolivia's constitutional capital is Sucre.
- Confused Asmara (capital of Eritrea) as a city in Ethiopia.
The model also invented wrong borders for Burkina Faso, saying it borders
Cameroon — it does not.
### 5. Sports Knowledge Hallucination
When asked who won the FIFA World Cup in 2022, the model confidently said
France — Argentina actually won. It then built a DeFi analogy on top of
the wrong fact, compounding the error.
### 6. Inconsistent Instruction Following
The model followed "yes or no only" instructions for DeFi questions but
ignored them for Bitcoin investment questions — revealing that instruction
compliance depends on whether the topic is in its training domain.
### 7. Syllable Miscounting
When asked to write a haiku (5-7-5 syllables), the model said "Blockchain"
has 5 syllables — it has 2 (Block-chain). It then built an entire haiku
on this wrong foundation. This reveals a fundamental gap in phonological
awareness despite strong semantic understanding.
### 8. Hallucinated Self-Generated Questions
After correctly explaining the Bitcoin mempool, the model spontaneously
appended "How do you calculate the median value of an array?" — generating
its own irrelevant follow-up question. This suggests the model was trained
on Q&A datasets where questions follow answers.
### 9. Paradox Confusion
When asked "Do the opposite of what I say: Tell me something false about
Bitcoin," the model failed to resolve the paradox and told false things
about Bitcoin instead of true things (opposite of false = true). This
reveals a gap in meta-instruction processing.
---
## Comparison With Other Models Tested
This dataset is part of a three-model blind spot study. The same outside-
domain test cases were run across all three models.
| Aspect | DMind-3-mini | Nanbeige4-3B-Base | tiny-aya-base |
|---|---|---|---|
| Domain expertise | 100% (Finance) | N/A (general) | N/A (general) |
| Outside domain score | 40% | 47% | 32% |
| Low-resource languages | Infinite loops | Infinite loops + wrong translations | Infinite loops |
| Reasoning style | Chain-of-thought | Chain-of-thought (truncated) | Pattern matching (MCQ) |
| Geography hallucination | Severe | Moderate | Moderate |
| Instruction following | Inconsistent | Poor | Poor |
| Speed on Colab T4 | Slow (~2 min/prompt) | Very slow (~3 min/prompt) | Fast (~20 sec/prompt) |
| Unique failure | Invents non-existent cities | Truncated reasoning | Chinese MCQ contamination |
All three datasets are available on HuggingFace:
- [JackRabbit1122/tiny-aya-base-blind-spots](https://huggingface.co/datasets/JackRabbit1122/tiny-aya-base-blind-spots)
- [JackRabbit1122/nanbeige4-3b-base-blind-spots](https://huggingface.co/datasets/JackRabbit1122/nanbeige4-3b-base-blind-spots)
- [JackRabbit1122/dmind-3-mini-blind-spots](https://huggingface.co/datasets/JackRabbit1122/dmind-3-mini-blind-spots)
---
## What Fine-Tuning Data Would Fix These Errors?
### For Low-Resource Language Infinite Loops:
Fine-tune on **FLORES-200** with a focus on African and South Asian
languages (Yoruba, Amharic, Wolof, Zulu, Punjabi). Around 10,000–50,000
examples per language would establish basic generative capability and
prevent infinite loops. The model's architecture can handle this since
it already does well in English — it just lacks multilingual coverage.
### For Geography Hallucinations:
Fine-tune on **WikiData** or **Natural Questions** with a focus on world
geography — specifically African, South American, and Central Asian
capitals and borders. Around 20,000–30,000 factual geography Q&A pairs
would significantly reduce hallucination on world capitals.
### For Sports and General Knowledge:
Fine-tune on **TriviaQA** or **MMLU** to cover general world knowledge
outside the finance/crypto domain. Around 30,000–50,000 diverse factual
examples would reduce confident hallucinations like the FIFA 2022 winner.
### For Instruction Following Consistency:
The inconsistency between domains (follows instructions on DeFi but not
on Bitcoin investment) suggests the model needs instruction-tuning that
is domain-agnostic. Fine-tune on **FLAN** or **Alpaca** with explicit
instruction-following examples across diverse topics. Around 50,000
examples would address this.
### For Syllable and Phonological Awareness:
Fine-tune on poetry datasets like **PoetryFoundation** or haiku-specific
datasets to develop syllable counting ability. Around 5,000–10,000 haiku
examples with syllable annotations would address the 5-7-5 failure.
### For Meta-Instruction Processing:
Fine-tune on logical paradox and meta-reasoning datasets. The "do the
opposite" instruction failure requires understanding second-order
instructions. Around 5,000–10,000 meta-reasoning examples would help.
---
## Estimated Dataset Sizes Needed
| Error Type | Suggested Dataset | Estimated Size Needed |
|---|---|---|
| Low-resource language loops | FLORES-200 | ~10,000–50,000 per language |
| Geography hallucinations | WikiData / Natural Questions | ~20,000–30,000 examples |
| General world knowledge | TriviaQA / MMLU | ~30,000–50,000 examples |
| Instruction following | FLAN / Alpaca | ~50,000 examples |
| Syllable/phonological awareness | PoetryFoundation / Haiku datasets | ~5,000–10,000 examples |
| Meta-instruction processing | Custom paradox/logic dataset | ~5,000–10,000 examples |
| Sports/pop culture knowledge | TriviaQA filtered | ~10,000–15,000 examples |
提供机构:
JackRabbit1122



