latam-gpt/CHOCLO
收藏Hugging Face2026-03-30 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/latam-gpt/CHOCLO
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- es
license: mit
task_categories:
- question-answering
task_ids:
- open-domain-qa
pretty_name: CHOCLO - Latin American Cultural Knowledge Benchmark
size_categories:
- 100K<n<1M
tags:
- culture
- latin-america
- benchmark
- knowledge
---
# 🌽 CHOCLO: Latin American Cultural Knowledge Benchmark
## Description
CHOCLO is a benchmark designed to evaluate cultural knowledge in language models, with a specific focus on entities representative of Latin America. Unlike traditional benchmarks, which often emphasize general knowledge or contexts dominated by English-language data, CHOCLO aims to capture the richness, diversity, and specificity of Latin American cultural knowledge, including traditions, gastronomy, public figures, geography, flora, fauna, and cultural artifacts.
The main objective of this benchmark is to analyze to what extent language models are able to accurately represent underrepresented cultures, as well as to identify patterns of omission, bias, or distortion in their responses. In this sense, CHOCLO goes beyond measuring accuracy, and instead evaluates how models handle culturally grounded and localized knowledge.
---
## Name Origin and Cultural Identity
The name CHOCLO is a deliberate choice intended to embed Latin American cultural identity within the benchmark itself. "Choclo", the Andean term for maize, is a deeply rooted element in the history, cuisine, and traditions of the region, present across multiple countries and cultural contexts.
The benchmark adopts this name as a metaphor for the diversity it seeks to represent. Just as choclo takes on different meanings, uses, and forms depending on the country or community, Latin American cultural knowledge is diverse, contextual, and inherently non-homogeneous. This choice also moves away from neutral or anglicized naming conventions, reinforcing the idea that evaluating cultural knowledge requires acknowledging the context in which that knowledge is produced.
---
## Benchmark Construction
CHOCLO is built through a hybrid pipeline that combines structured knowledge sources with controlled generation using language models. The construction process integrates contextual information from Wikipedia and structured data from Wikidata, but goes beyond direct extraction by introducing a generative step to create structured representations of knowledge.
The process begins with the selection of culturally relevant entities from multiple Latin American countries. For each entity, a textual description is retrieved from Wikipedia and used as contextual input. This context is then provided to a language model (GPT-3.5), which generates knowledge in the form of structured triplets (subject, relation, object) using carefully designed prompts. These prompts enforce semantic constraints, formatting requirements, and contextual grounding, ensuring that the generated triplets reflect meaningful and coherent knowledge about the entity.
Once generated, the triplets undergo a multi-stage filtering process. This includes semantic validation, redundancy removal, and structural consistency checks. The goal of this stage is to ensure that only high-quality, non-contradictory, and evaluable triplets are retained as part of the benchmark.
Based on these validated triplets, a second generation stage is performed to create question-answer pairs. Questions are generated using structured prompts that explicitly enforce clarity, cultural relevance, answerability, and alignment with the underlying triplet. This ensures that each question directly corresponds to a specific piece of structured knowledge.
Finally, both triplets and questions are subjected to human validation. This step ensures factual correctness, cultural appropriateness, and contextual coherence, which is particularly important given the localized nature of the knowledge being represented.
This hybrid methodology positions CHOCLO as a benchmark that combines structured knowledge representation, controlled generation, and human validation, enabling a more reliable evaluation of cultural knowledge in language models.
---
## Dataset Structure
Each instance in CHOCLO represents a culturally grounded unit of knowledge and is composed of both structured and natural language components. Specifically, each entry includes an entity, its associated category, the country of origin, a generated question, the expected answer, the underlying knowledge triplets, and a difficulty level.
This structure allows CHOCLO to evaluate not only the correctness of model responses, but also the alignment between generated answers and structured cultural knowledge.
---
## Categories
CHOCLO covers a diverse set of culturally relevant categories. Each category is designed to capture a different dimension of cultural knowledge:
- **dish**: Traditional foods and beverages that are representative of a specific country or region.
- **tradition**: Cultural practices, celebrations, rituals, and customs.
- **public_figure**: Notable individuals such as artists, politicians, athletes, or historical figures.
- **geography**: Locations such as cities, regions, natural landmarks, and geographic entities.
- **flora**: Plant species that are culturally or regionally significant.
- **fauna**: Animal species associated with a specific cultural or geographic context.
- **object**: Cultural artifacts, tools, or objects with symbolic or traditional value.
These categories enable the benchmark to capture multiple facets of cultural knowledge, ranging from tangible elements (e.g., food, geography) to more abstract or contextual ones (e.g., traditions).
---
## Countries Covered
The benchmark includes entities from the following Latin American countries:
Argentina, Bolivia, Chile, Colombia, Costa Rica, Cuba, Ecuador, El Salvador, Guatemala, Honduras, México, Nicaragua, Panamá, Paraguay, Perú, República Dominicana, Uruguay, and Venezuela.
This wide geographic coverage ensures representation across different regions of Latin America, allowing for a more comprehensive evaluation of cultural knowledge.
---
## Distribution
### Distribution by country
| Country | Count | Proportion | Percentage |
|:---------------------|--------:|-------------:|-------------:|
| Chile | 12546 | 0.11966 | 11.97 |
| México | 10081 | 0.0961496 | 9.61 |
| Argentina | 9908 | 0.0944996 | 9.45 |
| Bolivia | 9858 | 0.0940227 | 9.4 |
| Colombia | 9301 | 0.0887102 | 8.87 |
| Costa Rica | 9113 | 0.0869171 | 8.69 |
| Cuba | 7898 | 0.0753288 | 7.53 |
| Venezuela | 7146 | 0.0681565 | 6.82 |
| Perú | 6358 | 0.0606407 | 6.06 |
| Ecuador | 5381 | 0.0513224 | 5.13 |
| Paraguay | 4130 | 0.0393907 | 3.94 |
| Uruguay | 3192 | 0.0304444 | 3.04 |
| Guatemala | 2791 | 0.0266197 | 2.66 |
| Panamá | 1881 | 0.0179404 | 1.79 |
| El Salvador | 1807 | 0.0172346 | 1.72 |
| República Dominicana | 1490 | 0.0142112 | 1.42 |
| Nicaragua | 993 | 0.00947094 | 0.95 |
| Honduras | 973 | 0.00928019 | 0.93 |
### Distribution by category
| Category | Count | Proportion | Percentage |
|:--------------|--------:|-------------:|-------------:|
| geography | 45805 | 0.436875 | 43.69 |
| fauna | 22154 | 0.211298 | 21.13 |
| tradition | 11241 | 0.107213 | 10.72 |
| flora | 8074 | 0.0770074 | 7.7 |
| public_figure | 7986 | 0.0761681 | 7.62 |
| dish | 7088 | 0.0676033 | 6.76 |
| object | 2499 | 0.0238347 | 2.38 |
### Distribution by difficulty
| Difficulty | Count | Proportion | Percentage |
|:-------------|--------:|-------------:|-------------:|
| FÁCIL | 34886 | 0.332732 | 33.27 |
| INTERMEDIA | 34941 | 0.333257 | 33.33 |
| DIFÍCIL | 35020 | 0.334011 | 33.4 |
---
## Difficulty Levels
CHOCLO incorporates an explicit classification of question difficulty: **easy, intermediate, and hard**. This classification is based on the type of knowledge required to answer each question and the level of reasoning involved.
- **Easy**: Questions that can be answered directly from a single triplet. These typically involve explicit and commonly known information, requiring straightforward retrieval.
- **Intermediate**: Questions that require some level of interpretation or reformulation of the knowledge. Although still grounded in structured triplets, they may involve indirect relations or contextual understanding.
- **Hard**: Questions that involve more specific, less frequent, or more complex knowledge. These may require higher precision, deeper contextual understanding, or implicit reasoning.
This multi-level difficulty design allows CHOCLO to evaluate not only what models know, but how they handle increasing levels of complexity in cultural knowledge.
---
## Performance by Difficulty
The benchmark enables analysis of model performance across different levels of difficulty. Results show that models generally perform better on easy questions, while intermediate and hard questions introduce greater variability.
Interestingly, intermediate questions can sometimes be more challenging than hard ones, suggesting that ambiguity and semantic interpretation play a key role in model performance, beyond simple knowledge retrieval.
### Performance distribution

---
## Results
We evaluate multiple language models on the CHOCLO benchmark using a hybrid evaluation framework that combines lexical similarity, semantic similarity based on embeddings, and evaluation using a language model (LLM-as-a-judge). This approach allows us to capture both surface-level correctness and deeper semantic alignment.
The results show that GPT-4o Mini and GPT-3.5 Turbo achieve the highest overall performance, both reaching an average score of 0.48. These are followed by GPT-5 Mini with 0.46, Mistral with 0.45, and DeepSeek with 0.41.
A consistent pattern across all models is the gap between lexical similarity and semantic evaluation. While lexical scores remain relatively low (approximately 0.27–0.33), semantic and LLM-based scores are higher (approximately 0.51–0.55). This indicates that models are often able to capture the general meaning of an answer, even when they fail to reproduce it precisely.
Interestingly, performance improvements are not strictly monotonic across model generations. For example, GPT-5 Mini does not outperform GPT-4o Mini or GPT-3.5 Turbo in this benchmark. This suggests that cultural knowledge, particularly when localized, poses challenges that are not fully addressed by general improvements in model capability.
Overall, the results highlight that while modern language models demonstrate a reasonable understanding of cultural knowledge, they still struggle with precision, specificity, and contextual grounding in Latin American cultural domains.
### Model Performance
| Model | Lexical | Embedding | LLM Score | Average |
|----------------|--------|-----------|-----------|---------|
| GPT-4o Mini | 0.32 | 0.54 | 0.55 | 0.48 |
| GPT-3.5 Turbo | 0.33 | 0.55 | 0.55 | 0.48 |
| GPT-5 Mini | 0.29 | 0.53 | 0.54 | 0.46 |
| Mistral | 0.30 | 0.54 | 0.52 | 0.45 |
| DeepSeek | 0.27 | 0.51 | 0.45 | 0.41 |
---
## Examples
CHOCLO includes real examples extracted from model evaluations, illustrating both correct and incorrect responses. These cases highlight how models handle culturally grounded knowledge, including their ability to retrieve accurate information, as well as common failure modes such as generalization, misclassification, or lack of specificity.
### Correct Example
**Entity:** Tomina
**Country:** Bolivia
**Category:** geography
**Question:**
¿En qué departamento de Bolivia se encuentra el municipio de Tomina?
**Model Answer:**
En el departamento de Chuquisaca.
**Expected Answer:**
En el departamento de Chuquisaca.
✔ The model correctly identifies the geographic location with precise and accurate information, demonstrating proper retrieval of structured cultural knowledge.
---
### Incorrect Example
**Entity:** Sebastián Mancilla Olivares
**Country:** Chile
**Category:** public_figure
**Question:**
¿Qué profesión tuvo Sebastián Mancilla Olivares?
**Model Answer:**
Sebastián Mancilla Olivares fue ingeniero.
**Expected Answer:**
actor
✘ The model generates an incorrect profession, likely due to hallucination or confusion with unrelated entities, reflecting limitations in factual precision.
---
### Partially Correct Example
**Entity:** Cerros de la Calera
**Country:** Uruguay
**Category:** geography
**Question:**
¿Qué tipo de lugar es Cerros de la Calera?
**Model Answer:**
Es un área natural protegida.
**Expected Answer:**
una localidad uruguaya del departamento de Rivera
⚠ The model captures a plausible geographic interpretation but fails to provide the correct classification, showing partial understanding but lack of specificity.
---
## Usage
The CHOCLO dataset is provided in two formats:
- `BenchmarkCHOCLO.csv`: recommended for use with machine learning pipelines and the Hugging Face `datasets` library.
- `BenchmarkCHOCLO.xlsx`: provided for readability and manual inspection.
To load the dataset using Hugging Face:
```python
from datasets import load_dataset
dataset = load_dataset("latam-gpt/CHOCLO", data_files="BenchmarkCHOCLO.csv")
```
Each instance includes the following fields:
- `Entity`
- `Country`
- `Category`
- `Difficulty`
- `Question`
- `Answer`
This structure allows for evaluation of cultural knowledge across multiple dimensions, including factual accuracy, semantic understanding, and contextual relevance.
---
## Contact
For questions, collaborations, or academic use, please contact:
- **Bianca Del Solar**
bsdelsolar@uc.cl
- **Andrés Carvallo**
andrescarvallo.de@gmail.com
---
## Authorship
CHOCLO is part of the doctoral research work of **Bianca Del Solar**, developed in the context of her PhD studies. The benchmark is proposed as a contribution to the evaluation of cultural knowledge in language models, with a particular focus on Latin America.
This work has been developed with the collaboration of **Andrés Carvallo**, contributing to the construction, validation, and analysis of the benchmark. The research has been conducted under the supervision of **Álvaro Soto**, who serves as the thesis advisor.
The authorship of the benchmark remains with its creators, highlighting both its academic origin and its collaborative development.
## Citation
If you use CHOCLO in your research, please cite:
```bibtex
@dataset{choclo2026,
title={CHOCLO: Latin American Cultural Knowledge Benchmark},
author={Del Solar, Bianca and Carvallo, Andrés and Soto, Álvaro},
year={2026},
publisher={Hugging Face},
note={Developed as part of doctoral research under the supervision of Álvaro Soto}
}
```
提供机构:
latam-gpt



