nthomas123/gemma4-yoruba-blindspot
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/nthomas123/gemma4-yoruba-blindspot
下载链接
链接失效反馈官方服务:
资源简介:
Model Tested: https://huggingface.co/google/gemma-4-E2B-it
I loaded the model by going to the model’s page on Hugging Face, clicking the “Use this model” button, and then selecting Google Colab, which already provided the setup to run the model. In the model’s description, it states that it is multilingual, with a training dataset that includes content in over 140 languages. This made me curious to test whether all languages were used equally during training, especially less widely used languages compared to English.
I first tested the model by asking it a simple question in several languages, including English, Swahili, Yoruba, Bengali, Nepali, and Tagalog. I noticed that the model performed the worst when asked a question in Yoruba, so I decided to investigate its capabilities in that specific language further. The model’s responses were inconsistent, sometimes responding in Yoruba, sometimes switching to English, and often producing incoherent or repetitive output.
This brings up an even larger concern around issues of disparity and discrimination within multilingual models, which perform better in high-resource languages and poorly in low-resource languages. This reflects a broader issue of llms perpetuating technological disparity.
In order to solve this problem, one solution could be fine-tuning the model with better and higher-quality multilingual data, including low-resource languages like Yoruba. This dataset could be assembled using examples of conversation spoken in Yoruba, and if not enough real examples of Yoruba are available, a synthetic Yoruba dataset could be generated. A dataset of at least several thousand to tens of thousands of high-quality examples per language would likely be needed to improve performance.
In addition, I noticed that the model only asked to confirm the language of the input after multiple attempts. The model was not able to identify its failure to comprehend the prompt and gave inaccurate or non-sense answers. This aligns with the Chinese Room argument stating that a machine may demonstrate knowledge of a language through following guidelines and patterns while not being able to comprehend its meaning. Similarly, the model did not comprehend Yoruba but made assumptions based on its training dataset. This results in AI hallucinations when the model gives confident, yet false or meaningless answers to the prompts. This brings up a significant issue regarding the comprehension level of language models and whether their answers represent genuine understanding or pattern matching. Furthermore, it highlights a dangerous tendency among language models: giving any possible answer rather than acknowledging their inability to respond to a prompt, which could be particularly detrimental in fields such as healthcare.
Here was the code used to load the model:
# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("google/gemma-4-E2B-it")
model = AutoModelForImageTextToText.from_pretrained("google/gemma-4-E2B-it")
测试模型:https://huggingface.co/google/gemma-4-E2B-it
我通过以下步骤加载该模型:前往该模型在Hugging Face的模型主页,点击"使用此模型"按钮,随后选择Google Colab,该平台已预先配置好运行该模型所需的运行环境。模型介绍页面显示,该模型为多语言模型,其训练数据集涵盖超过140种语言的文本内容。这引发了我的探究兴趣:训练过程中是否对所有语言一视同仁,尤其是相较于英语的低资源语言。
我首先使用英语、斯瓦希里语、约鲁巴语、孟加拉语、尼泊尔语及他加禄语等多种语言向模型提出简单问题,对其进行测试。我发现当使用约鲁巴语提问时,模型的表现最差,因此决定进一步探究该模型在约鲁巴语场景下的能力。该模型的输出结果极不稳定:时而以约鲁巴语作答,时而切换为英语,且时常生成语义混乱或重复冗余的内容。
这引发了一个更严峻的担忧:多语言模型普遍存在资源分配不均与歧视性问题——在高资源语言上表现优异,却在低资源语言上性能低下。这也映射出大语言模型(Large Language Model,LLM)加剧技术鸿沟这一更广范围的议题。
针对该问题,可行的解决方案之一是使用更优质的多语言数据对模型进行微调,其中应纳入约鲁巴语这类低资源语言。该数据集可通过收集约鲁巴语口语对话语料构建;若真实约鲁巴语语料不足,也可生成合成约鲁巴语数据集。若要有效提升模型性能,每种语言至少需要数千至数万条高质量语例构建的数据集。
此外,我观察到该模型仅在多次尝试后才会请求确认输入文本的语言。该模型无法识别自身未能理解输入提示的情况,反而会输出不准确或无意义的回答。这与"中文屋论证(Chinese Room Argument)"的观点相符:机器可通过遵循规则与模式来展现对某一语言的掌握,却无法真正理解该语言的语义。类似地,该模型并未真正理解约鲁巴语,仅基于训练数据集做出了主观推断。当模型自信地对输入提示给出虚假或无意义的回答时,便会产生AI幻觉(AI Hallucination)。这引出了一个关键议题:语言模型的语言理解能力究竟处于何种水平,其输出的回答究竟代表了真正的语义理解,还是仅为模式匹配的结果。此外,这也暴露了语言模型的一个危险倾向:相较于承认自身无法响应输入提示,模型更倾向于生成任意可能的回答,这一问题在医疗保健等领域可能会造成尤为严重的危害。
以下为加载该模型所用的代码:
# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("google/gemma-4-E2B-it")
model = AutoModelForImageTextToText.from_pretrained("google/gemma-4-E2B-it")
提供机构:
nthomas123



