five

JackRabbit1122/nanbeige4-3b-base-blind-spots

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JackRabbit1122/nanbeige4-3b-base-blind-spots
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: int64 - name: category dtype: string - name: input dtype: string - name: expected_output dtype: string - name: model_output dtype: string - name: passed dtype: bool splits: - name: train num_bytes: 11396 num_examples: 65 download_size: 11677 dataset_size: 11396 configs: - config_name: default data_files: - split: train path: data/train-* language: - en - zh - yo - sw - ur - hi - ha - am - wo - pa - bn - ar - eo - zu license: apache-2.0 task_categories: - text-generation tags: - blind-spots - multilingual - evaluation - base-model - chinese - reasoning pretty_name: Nanbeige4-3B-Base Blind Spot Dataset size_categories: - n<100 --- # Nanbeige4-3B-Base Blind Spot Dataset ## Model Tested **[Nanbeige/Nanbeige4-3B-Base](https://huggingface.co/Nanbeige/Nanbeige4-3B-Base)** - Architecture: LLaMA-based - Parameters: 4B - Type: Base model (not instruction-tuned) - Primary Language: Chinese + English - Released: December 2025 --- ## How I Loaded the Model ```python !pip install transformers torch accelerate -q from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "Nanbeige/Nanbeige4-3B-Base" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, dtype=torch.bfloat16, device_map="auto" ) def generate(prompt, max_new_tokens=150): inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.1, top_p=0.9, top_k=50, repetition_penalty=1.1 ) new_tokens = outputs[0][inputs["input_ids"].shape[1]:] return tokenizer.decode(new_tokens, skip_special_tokens=True) ``` Platform: Google Colab (T4 GPU, free tier) --- ## Dataset Description This dataset contains 65 test cases designed to probe the blind spots of `Nanbeige/Nanbeige4-3B-Base`. Each entry contains: - `input`: The prompt given to the model - `expected_output`: The correct answer - `model_output`: What the model actually produced - `passed`: Whether the model answered correctly ### Results Summary | Category | Tests | Passed | Failed | Score | |---|---|---|---|---| | Multilingual (10 languages) | 15 | 3 | 12 | 20% | | Math & Logic | 12 | 6 | 6 | 50% | | Commonsense & Science | 8 | 4 | 4 | 50% | | Code & Programming | 5 | 3 | 2 | 60% | | Complex Reasoning | 10 | 6 | 4 | 60% | | Instruction Following | 8 | 4 | 4 | 50% | | Factual Knowledge | 7 | 5 | 2 | 71% | | **Total** | **65** | **31** | **34** | **47%** | --- ## Key Findings ### 1. Output Truncation Due to Chain-of-Thought *(Most Severe)* The model uses internal `<think>` tags to reason before answering. While reasoning is often correct, the output gets cut off before producing a final answer due to token limits. This affected 15+ test cases where the model was clearly on the right track but never finished. This is the single biggest practical weakness — the reasoning is there but the answer never arrives. ### 2. Wrong Language Identification When prompted in Esperanto, the model identified it as Spanish and began reasoning in the wrong language entirely. This cross-language confusion is unique to this model (tiny-aya-base also confused Esperanto with Spanish, while DMind-3-mini confused it with Albanian). ### 3. Cross-Language Hallucination on African Languages The model confused two completely unrelated African languages — it translated the Swahili word "Thelathini" (thirty) as a Zulu word meaning "young woman." It also translated the Amharic phrase "5 minus 3" as "How many days are in a month?" showing the model conflates low-resource African languages into one undifferentiated category. ### 4. Factual Hallucinations on Non-Chinese Topics - Called Pakistan's Independence Day "Bharatvarsh Diwas" — invented name - Translated Arabic proverb ending as "walls" instead of "stone" - Translated Zulu "I see you" as "I see you, I love you" — added a word - Confused Punjabi "5 plus 5" as "5 times 5" ### 5. Strong Complex Reasoning *(Positive Finding)* Unlike tiny-aya-base, Nanbeige performed well on complex tasks: - Correctly solved multi-step apple pricing ($7.20) - Correctly converted Roman numerals XIV + IX = XXIII - Correctly identified Hundred Years War duration (~106 years) - Correctly identified code bug (subtraction instead of addition) - Correctly solved recursion f(3) = 6 - Correctly applied modus tollens logic ### 6. Infinite Loops on Low-Resource Languages Like all models in this study, Nanbeige entered infinite loops on Swahili proverbs and other low-resource African language prompts. ### 7. Instruction Following Failure Ignored explicit instructions like "answer in one word" and "answer yes or no only" — producing multi-paragraph responses instead. ### 8. Self Awareness Hallucination When asked to count words in its previous response, the model fabricated a fake previous response and counted its words — rather than acknowledging it could not access prior context. --- ## Comparison With Other Models in This Study This dataset is part of a three-model blind spot study also covering `CohereLabs/tiny-aya-base` and `DMindAI/DMind-3-mini`. | Aspect | Nanbeige4-3B | tiny-aya-base | DMind-3-mini | |---|---|---|---| | Total tests | 65 | 65 | 90 | | Overall score | 47% | 32% | 66% | | Main failure | Output truncation | Chinese MCQ contamination | Geography hallucination | | Low-resource languages | Loops + wrong translations | Infinite loops | Infinite loops | | Reasoning style | Chain-of-thought (truncated) | Pattern matching | Chain-of-thought | | Speed on Colab T4 | Very slow (~3 min/prompt) | Fast (~20 sec/prompt) | Slow (~2 min/prompt) | | Self-contradiction | Rare | Very common | Inconsistent | | Complex reasoning | Strong | Weak | Strong (in domain) | | Instruction following | Poor | Poor | Inconsistent | | Unique failure | Hallucinates holiday names | MCQ marks correct answers Wrong | Invents non-existent cities | All three datasets: - [JackRabbit1122/tiny-aya-base-blind-spots](https://huggingface.co/datasets/JackRabbit1122/tiny-aya-base-blind-spots) - [JackRabbit1122/nanbeige4-3b-base-blind-spots](https://huggingface.co/datasets/JackRabbit1122/nanbeige4-3b-base-blind-spots) - [JackRabbit1122/dmind-3-mini-blind-spots](https://huggingface.co/datasets/JackRabbit1122/dmind-3-mini-blind-spots) --- ## What Fine-Tuning Data Would Fix These Errors? ### For Output Truncation: Primarily a token limit issue — setting `max_new_tokens=300+` resolves most truncation failures. The underlying verbosity could be reduced by fine-tuning on concise answer datasets like **Natural Questions** or **TriviaQA**. Around 20,000 examples would help. ### For Cross-Language Confusion: Fine-tune on **FLORES-200** covering 200 languages including low-resource African languages. Focus on Swahili, Yoruba, Wolof, Amharic, Zulu, and Punjabi. Around 10,000–50,000 examples per language. ### For Factual Hallucinations: Fine-tune on **TriviaQA**, **Natural Questions**, or **MMLU** focusing on non-Chinese world knowledge — South Asian, African, and Middle Eastern history and culture. Around 20,000–50,000 diverse factual Q&A pairs. ### For Instruction Following: Fine-tune on **FLAN** or **Alpaca** instruction datasets. As a base model this is expected — approximately 50,000 instruction-following examples would address this. ### For Math Across Scripts: Fine-tune on **MGSM** (Multilingual Grade School Math) covering math in 10+ languages. Around 5,000–10,000 examples covering non-Latin scripts. --- ## Estimated Dataset Sizes Needed | Error Type | Suggested Dataset | Estimated Size Needed | |---|---|---| | Output truncation | Natural Questions / TriviaQA | Increase max_new_tokens first | | Cross-language confusion | FLORES-200 | ~10,000–50,000 per language | | Factual hallucinations | MMLU / TriviaQA | ~20,000–50,000 examples | | Instruction following | FLAN / Alpaca | ~50,000 examples | | Math across scripts | MGSM multilingual | ~5,000–10,000 examples | | Low-resource language loops | mC4 / CC-100 | ~10,000 per language | | Complex reasoning gaps | GSM8K / CommonsenseQA | ~10,000–20,000 examples |
提供机构:
JackRabbit1122
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作