JackRabbit1122/nanbeige4-3b-base-blind-spots
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JackRabbit1122/nanbeige4-3b-base-blind-spots
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: int64
- name: category
dtype: string
- name: input
dtype: string
- name: expected_output
dtype: string
- name: model_output
dtype: string
- name: passed
dtype: bool
splits:
- name: train
num_bytes: 11396
num_examples: 65
download_size: 11677
dataset_size: 11396
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
language:
- en
- zh
- yo
- sw
- ur
- hi
- ha
- am
- wo
- pa
- bn
- ar
- eo
- zu
license: apache-2.0
task_categories:
- text-generation
tags:
- blind-spots
- multilingual
- evaluation
- base-model
- chinese
- reasoning
pretty_name: Nanbeige4-3B-Base Blind Spot Dataset
size_categories:
- n<100
---
# Nanbeige4-3B-Base Blind Spot Dataset
## Model Tested
**[Nanbeige/Nanbeige4-3B-Base](https://huggingface.co/Nanbeige/Nanbeige4-3B-Base)**
- Architecture: LLaMA-based
- Parameters: 4B
- Type: Base model (not instruction-tuned)
- Primary Language: Chinese + English
- Released: December 2025
---
## How I Loaded the Model
```python
!pip install transformers torch accelerate -q
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Nanbeige/Nanbeige4-3B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
dtype=torch.bfloat16,
device_map="auto"
)
def generate(prompt, max_new_tokens=150):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.1,
top_p=0.9,
top_k=50,
repetition_penalty=1.1
)
new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(new_tokens, skip_special_tokens=True)
```
Platform: Google Colab (T4 GPU, free tier)
---
## Dataset Description
This dataset contains 65 test cases designed to probe the blind spots of
`Nanbeige/Nanbeige4-3B-Base`. Each entry contains:
- `input`: The prompt given to the model
- `expected_output`: The correct answer
- `model_output`: What the model actually produced
- `passed`: Whether the model answered correctly
### Results Summary
| Category | Tests | Passed | Failed | Score |
|---|---|---|---|---|
| Multilingual (10 languages) | 15 | 3 | 12 | 20% |
| Math & Logic | 12 | 6 | 6 | 50% |
| Commonsense & Science | 8 | 4 | 4 | 50% |
| Code & Programming | 5 | 3 | 2 | 60% |
| Complex Reasoning | 10 | 6 | 4 | 60% |
| Instruction Following | 8 | 4 | 4 | 50% |
| Factual Knowledge | 7 | 5 | 2 | 71% |
| **Total** | **65** | **31** | **34** | **47%** |
---
## Key Findings
### 1. Output Truncation Due to Chain-of-Thought *(Most Severe)*
The model uses internal `<think>` tags to reason before answering. While
reasoning is often correct, the output gets cut off before producing a
final answer due to token limits. This affected 15+ test cases where the
model was clearly on the right track but never finished. This is the
single biggest practical weakness — the reasoning is there but the answer
never arrives.
### 2. Wrong Language Identification
When prompted in Esperanto, the model identified it as Spanish and began
reasoning in the wrong language entirely. This cross-language confusion
is unique to this model (tiny-aya-base also confused Esperanto with
Spanish, while DMind-3-mini confused it with Albanian).
### 3. Cross-Language Hallucination on African Languages
The model confused two completely unrelated African languages — it
translated the Swahili word "Thelathini" (thirty) as a Zulu word meaning
"young woman." It also translated the Amharic phrase "5 minus 3" as
"How many days are in a month?" showing the model conflates low-resource
African languages into one undifferentiated category.
### 4. Factual Hallucinations on Non-Chinese Topics
- Called Pakistan's Independence Day "Bharatvarsh Diwas" — invented name
- Translated Arabic proverb ending as "walls" instead of "stone"
- Translated Zulu "I see you" as "I see you, I love you" — added a word
- Confused Punjabi "5 plus 5" as "5 times 5"
### 5. Strong Complex Reasoning *(Positive Finding)*
Unlike tiny-aya-base, Nanbeige performed well on complex tasks:
- Correctly solved multi-step apple pricing ($7.20)
- Correctly converted Roman numerals XIV + IX = XXIII
- Correctly identified Hundred Years War duration (~106 years)
- Correctly identified code bug (subtraction instead of addition)
- Correctly solved recursion f(3) = 6
- Correctly applied modus tollens logic
### 6. Infinite Loops on Low-Resource Languages
Like all models in this study, Nanbeige entered infinite loops on Swahili
proverbs and other low-resource African language prompts.
### 7. Instruction Following Failure
Ignored explicit instructions like "answer in one word" and "answer yes
or no only" — producing multi-paragraph responses instead.
### 8. Self Awareness Hallucination
When asked to count words in its previous response, the model fabricated
a fake previous response and counted its words — rather than
acknowledging it could not access prior context.
---
## Comparison With Other Models in This Study
This dataset is part of a three-model blind spot study also covering
`CohereLabs/tiny-aya-base` and `DMindAI/DMind-3-mini`.
| Aspect | Nanbeige4-3B | tiny-aya-base | DMind-3-mini |
|---|---|---|---|
| Total tests | 65 | 65 | 90 |
| Overall score | 47% | 32% | 66% |
| Main failure | Output truncation | Chinese MCQ contamination | Geography hallucination |
| Low-resource languages | Loops + wrong translations | Infinite loops | Infinite loops |
| Reasoning style | Chain-of-thought (truncated) | Pattern matching | Chain-of-thought |
| Speed on Colab T4 | Very slow (~3 min/prompt) | Fast (~20 sec/prompt) | Slow (~2 min/prompt) |
| Self-contradiction | Rare | Very common | Inconsistent |
| Complex reasoning | Strong | Weak | Strong (in domain) |
| Instruction following | Poor | Poor | Inconsistent |
| Unique failure | Hallucinates holiday names | MCQ marks correct answers Wrong | Invents non-existent cities |
All three datasets:
- [JackRabbit1122/tiny-aya-base-blind-spots](https://huggingface.co/datasets/JackRabbit1122/tiny-aya-base-blind-spots)
- [JackRabbit1122/nanbeige4-3b-base-blind-spots](https://huggingface.co/datasets/JackRabbit1122/nanbeige4-3b-base-blind-spots)
- [JackRabbit1122/dmind-3-mini-blind-spots](https://huggingface.co/datasets/JackRabbit1122/dmind-3-mini-blind-spots)
---
## What Fine-Tuning Data Would Fix These Errors?
### For Output Truncation:
Primarily a token limit issue — setting `max_new_tokens=300+` resolves
most truncation failures. The underlying verbosity could be reduced by
fine-tuning on concise answer datasets like **Natural Questions** or
**TriviaQA**. Around 20,000 examples would help.
### For Cross-Language Confusion:
Fine-tune on **FLORES-200** covering 200 languages including low-resource
African languages. Focus on Swahili, Yoruba, Wolof, Amharic, Zulu, and
Punjabi. Around 10,000–50,000 examples per language.
### For Factual Hallucinations:
Fine-tune on **TriviaQA**, **Natural Questions**, or **MMLU** focusing
on non-Chinese world knowledge — South Asian, African, and Middle Eastern
history and culture. Around 20,000–50,000 diverse factual Q&A pairs.
### For Instruction Following:
Fine-tune on **FLAN** or **Alpaca** instruction datasets. As a base
model this is expected — approximately 50,000 instruction-following
examples would address this.
### For Math Across Scripts:
Fine-tune on **MGSM** (Multilingual Grade School Math) covering math in
10+ languages. Around 5,000–10,000 examples covering non-Latin scripts.
---
## Estimated Dataset Sizes Needed
| Error Type | Suggested Dataset | Estimated Size Needed |
|---|---|---|
| Output truncation | Natural Questions / TriviaQA | Increase max_new_tokens first |
| Cross-language confusion | FLORES-200 | ~10,000–50,000 per language |
| Factual hallucinations | MMLU / TriviaQA | ~20,000–50,000 examples |
| Instruction following | FLAN / Alpaca | ~50,000 examples |
| Math across scripts | MGSM multilingual | ~5,000–10,000 examples |
| Low-resource language loops | mC4 / CC-100 | ~10,000 per language |
| Complex reasoning gaps | GSM8K / CommonsenseQA | ~10,000–20,000 examples |
提供机构:
JackRabbit1122



