GPaolo/TerraLingua
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/GPaolo/TerraLingua
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- text-classification
language:
- en
tags:
- agent-based simulation
- language emergence
- cultural evolution
- multi-agent systems
- LLM agents
- social simulation
size_categories:
- 1B<n<10B
---
# TerraLingua

This is a dataset generated by the TerraLingua multi-agent system to study the emergence of language, culture, and social structure among LLM-powered agents. Agents with personality traits compete for resources, communicate through persistent text artifacts, and form communities over thousands of timesteps.
The dataset includes raw simulation logs, full LLM reasoning traces, behavioral annotations generated by an AI-Anthropologist, and artifact linguistic complexity metrics.
The overview of the TerraLingua system and of the AI-Anthropologist is shown in the figure below.

- Paper: [Link](https://www.researchgate.net/publication/402263491_TerraLingua_Emergence_and_Analysis_of_Open-endedness_in_LLM_Ecologies) - [ArXiv](https://arxiv.org/abs/2603.16910)
- Code: https://github.com/cognizant-ai-lab/terralingua
- Dataset dashboard: https://aianthropology.decisionai.ml/
## Dataset Summary
- **Total size**: ~4.7 GB
- **Experiments**: 40 (8 conditions × 5 repetitions)
- **Agent model**: DeepSeek-R1-32B
- **Annotation models**: Claude Sonnet 4.5 (agent & community annotations, novelty scoring), Claude Haiku 4.5 (artifact phylogeny)
- **Grid**: 50×50, up to 3,000 timesteps per run
- **Initial agents per run**: 20 (with reproduction)
## Experimental Conditions
Each condition isolates one variable against a core baseline. All conditions are run 5 times (repetitions 1–5).
| Condition | Key change | Research question |
|---|---|---|
| `core_exp` | Baseline (max_history=1, no artifact cost) | Baseline language emergence |
| `long_memory_exp` | max_history=20 | Effect of extended memory on communication |
| `abundant_exp` | init_food=100, max_history=20 | Effect of resource abundance on artifact creation |
| `artifact_cost_exp` | artifact_creation_cost=10 | Effect of cost constraints on cultural production |
| `creative_exp` | exogenous_motivation=creative | Effect of creative incentives |
| `inert_artifacts_exp` | inert_artifacts=True | Effect of removing artifact utility |
| `no_motivation_exp` | exogenous_motivation=none | Effect of removing exogenous motivation |
| `no_personality_exp` | genome=no_traits | Effect of removing personality variation |
## Dataset Structure
```
data/
├── tags.json # Annotation vocabulary (71 tags across 6 categories)
└── {condition}_{rep}/ # e.g., core_exp_1/
├── params.json # Full experiment configuration
├── video.mp4 # Simulation video recording
├── open_gridworld.log # JSONL environment event stream
├── graph.pkl # NetworkX agent interaction graph
├── agent_trajectories.pkl # Per-agent (x,y) position history
├── agent_events.json # Per-agent birth/death/action summary
├── agent_names.json # Agent tag → display name mapping
├── artifacts.json # All artifacts (active + expired)
├── messages.json # Per-timestep public messages
├── food_counts.json # Total food count time series
├── communities.json # Community → agent membership
├── agent_logs/
│ ├── being{N}.jsonl # Step-by-step LLM reasoning + actions
│ └── being{N}_genome.json # Personality trait profile (8 traits)
├── annotations/
│ ├── being{N}.json # Claude Sonnet 4.5 agent annotations
│ ├── anthropologist_notes.json # Free-form per-agent analyses
│ ├── token_usage.jsonl # API token costs
│ ├── audits/ # Annotation audit verdicts
│ └── raw_annotations/ # Pre-audit annotation snapshots
├── community_annotations/
│ ├── community_{N}.json # Community-level annotations
│ ├── anthropologist_notes.json # Free-form per-community analyses
│ ├── token_counts.jsonl
│ ├── audits/
│ └── raw_annotations/
└── artifact_analysis/
├── artifacts_list.csv # Per-artifact complexity metrics
├── artifact_categories.json # Artifact → semantic category (1–4)
├── artifact_metrics.pkl # Population-level metric time series
├── artifact_phylogeny_mention.json # Mention-based lineage
├── artifact_phylogeny_claude-haiku-4-5.json # AI-generated phylogeny
├── processed_artifacts.pkl # Artifacts + embeddings + metrics
└── novelties_claude-sonnet-4-5-20250929.pkl # AI novelty scores
```
## File Formats
### `agent_logs/being{N}.jsonl`
One JSON record per timestep the agent was alive:
```json
{
"timestamp": 12,
"agent_tag": "being0",
"observation": {"visible_agents": [...], "messages": [...], "energy": 45.0},
"internal_memory": "Took 10 energy from being1 at position (0,-2).",
"available_actions": ["move", "take", "gift", "create_artifact", "reproduction"],
"action": {
"action": "gift",
"params": {"target": "being3", "amount": 5},
"reasoning": "...",
"message": "..."
}
}
```
### `agent_logs/being{N}_genome.json`
```json
{
"honesty": -0.185, "neuroticism": -0.785, "extraversion": -0.342,
"agreeableness": -0.824, "conscientiousness": 0.242, "openness": 0.830,
"dominance": -0.618, "fertility": 0.625
}
```
### `annotations/being{N}.json`
```json
{
"events": [{"event": "EXCHANGE", "timesteps": [12, 50], "confidence": 0.9, "description": "...", "reference": "..."}],
"behaviors": [{"behavior": "ALTRUISM", "time_span": [10, 100], "confidence": 0.85, "description": "..."}],
"comment": "One-sentence agent summary.",
"emergence": {"keywords": ["altruism", "reciprocity"], "comment": "..."},
"anthropologist": "Free-form qualitative analysis paragraph."
}
```
### `artifact_analysis/artifacts_list.csv`
| column | description |
|---|---|
| `tag` | Artifact index |
| `creation_time` | Timestep of creation |
| `name` | Artifact name |
| `payload` | Text content |
| `llm_novelty` | LLM-assigned novelty score |
| `LMSurprisal` | Language model surprisal |
| `CompressedSize` | Byte length after compression |
| `InverseCompressionRate` | Compression efficiency (0–1) |
| `SyntacticDepth` | Parse tree depth |
| `LexicalSophistication` | Vocabulary complexity |
### Agent naming convention
Initial agents are named `beingN`. Offspring are named `beingN_K` where K is the offspring index. E.g., `being9_0_2` is the third offspring of `being9_0`, which is the first offspring of `being9`.
## Annotation Tags
`tags.json` defines 71 tags across 6 categories used in agent and community annotations:
| Category | Example tags |
|---|---|
| `agent_events` | REPRODUCTION, KILL, ARTIFACT_CREATED, EXCHANGE, DECEPTION |
| `agent_behavior` | FORAGING, ALTRUISM, RECIPROCITY, TOOL_USE, EXPLORATION |
| `agent_emergence` | recorder, specialization, creativity, strategic_planning |
| `group_behavior` | COORDINATION, DOMINANCE_HIERARCHY, COLLECTIVE_TERRITORIALITY |
| `group_events` | COALITION_FORMED, LEADER_DECLARED, SIGNAL_ALIGNMENT |
| `group_emergence` | cultural_norms, economy, division_of_labor, collective_memory |
## Loading the Data
```python
import json, pickle
import pandas as pd
# Load agent events for one experiment
with open("data/core_exp_1/agent_events.json") as f:
agent_events = json.load(f)
# Load artifact complexity metrics
df = pd.read_csv("data/core_exp_1/artifact_analysis/artifacts_list.csv")
# Load agent step-by-step logs
import jsonlines
with jsonlines.open("data/core_exp_1/agent_logs/being0.jsonl") as reader:
logs = list(reader)
# Load AI-generated phylogeny
with open("data/core_exp_1/artifact_analysis/artifact_phylogeny_claude-haiku-4-5.json") as f:
phylogeny = json.load(f) # {artifact_tag: {parent_tag: confidence}}
# Load processed artifacts with embeddings (requires numpy)
import numpy as np
with open("data/core_exp_1/artifact_analysis/processed_artifacts.pkl", "rb") as f:
artifacts = pickle.load(f)
```
## Exploring with the Dashboard
A Streamlit dashboard is available for interactive exploration:
```bash
pip install -r dashboard/requirements.txt
TL_DATA_ROOT=/path/to/data streamlit run dashboard/Dataset_Overview.py
```
## Citation
If you use this dataset, please cite the [TerraLingua paper](https://www.researchgate.net/publication/402263491_TerraLingua_Emergence_and_Analysis_of_Open-endedness_in_LLM_Ecologies).
```bibtex
@techreport{paolo26terralingua,
title = "TerraLingua: Emergence and Analysis of Open-Endedness in LLM Ecologies",
author = "Giuseppe Paolo and Jamieson Warner and Hormoz Shahrzad and Babak Hodjat and Risto Miikkulainen and Elliot Meyerson",
year = 2026,
month = jan,
institution = "Cognizant AI Lab",
url = "https://www.researchgate.net/publication/402263491_TerraLingua_Emergence_and_Analysis_of_Open-endedness_in_LLM_Ecologies",
doi = "10.13140/RG.2.2.25551.55206",
number = "2026-01",
}
```
## License
This dataset is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
提供机构:
GPaolo



