tonyeh/chameleon-dataset
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/tonyeh/chameleon-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
dataset_info:
configs:
- config_name: combined
data_files:
- split: train
path: chameleon_profiles_combined.csv
- config_name: seance
data_files:
- split: train
path: chameleon_profiles_seance.csv
- config_name: langextract
data_files:
- split: train
path: chameleon_profiles_langextract.csv
---
# Chameleon: A Dataset of Contextual Psychological Profiles
## Overview
Chameleon is a dataset of **5,001 contextual psychological profiles** from **1,667 Reddit users**, each measured across multiple subreddit contexts. Unlike existing persona datasets that treat psychological profiles as fixed user attributes, Chameleon measures the same users across multiple contexts, enabling principled decomposition of psychological variance into stable traits and contextual states.
The dataset accompanies the paper:
> **Beyond Fixed Psychological Personas: State Beats Trait, but Language Models are State-Blind**
> Tamunotonye Harry, Ivoline C. Ngong, Chima Nweke, Yuanyuan Feng, Joseph Near
> *Findings of the Association for Computational Linguistics: ACL 2026*
---
## Key Finding
Using Latent State-Trait (LST) theory and intraclass correlation coefficients, we find that **72–74% of psychological variance is within-person (state)** while only 26–28% is between-person (trait). Context shapes expressed psychology 2–3× more than stable individual differences.
---
## Files
| File | Description |
|------|-------------|
| `chameleon_profiles_combined.csv` | Fused SEANCE + LangExtract profiles (z-normalized mean). **Recommended for most use cases.** |
| `chameleon_profiles_seance.csv` | SEANCE-derived profiles (lexicon-based extraction) |
| `chameleon_profiles_langextract.csv` | LangExtract-derived profiles (GPT-4o semantic extraction) |
Each file contains **5,001 rows × 29 columns**: `post_id`, `user_id`, `subreddit`, and 26 psychological scale scores.
---
## Schema
### Base Columns
| Column | Description |
|--------|-------------|
| `post_id` | Reddit post ID (maps to Webis-TLDR-17) |
| `user_id` | Pseudonymized author identifier (e.g., user_0000) |
| `subreddit` | Subreddit community (psychological context) |
### Psychological Scale Columns (26 total)
#### Big Five Inventory (BFI-44) — Scale: 1–5
| Column | Construct |
|--------|-----------|
| `bfi_extraversion` | Sociability, assertiveness |
| `bfi_agreeableness` | Cooperation, trust |
| `bfi_conscientiousness` | Organization, discipline |
| `bfi_neuroticism` | Emotional instability, anxiety |
| `bfi_openness` | Intellectual curiosity, creativity |
#### Schwartz Value Survey (SVS-57) — Scale: −1 to 7
| Column | Construct |
|--------|-----------|
| `svs_power` | Status, dominance |
| `svs_achievement` | Success, competence |
| `svs_hedonism` | Pleasure, enjoyment |
| `svs_stimulation` | Excitement, novelty |
| `svs_self_direction` | Independence, autonomy |
| `svs_universalism` | Social justice, tolerance |
| `svs_benevolence` | Caring for close others |
| `svs_tradition` | Cultural/religious customs |
| `svs_conformity` | Rule-following |
| `svs_security` | Safety, stability |
#### Self-Determination Theory (SDT) — Scale: 1–7
| Column | Construct |
|--------|-----------|
| `sdt_intrinsic_motivation` | Internal drive, curiosity |
| `sdt_extrinsic_motivation` | External rewards |
| `sdt_competence` | Feeling capable, effective |
| `sdt_autonomy` | Sense of choice, self-direction |
| `sdt_relatedness` | Social connection, belonging |
#### Domain-Specific Risk-Taking (DOSPERT-40) — Scale: 1–7
| Column | Construct |
|--------|-----------|
| `dospert_investment` | Financial risk-taking |
| `dospert_gambling` | Gambling propensity |
| `dospert_health_safety` | Health risk-taking |
| `dospert_recreational` | Physical risk-taking |
| `dospert_ethical` | Ethical boundary-pushing |
| `dospert_social` | Social risk-taking |
---
## Reproducing the Sample from Webis-TLDR-17
The post text is not included in this dataset for privacy reasons (see Ethics section). To reproduce the exact sample:
**Source corpus:** [Webis-TLDR-17](https://huggingface.co/datasets/webis/tldr-17)
**Filtering criteria:**
- Users who posted across at least 3 distinct subreddits
- Exactly 3 posts randomly sampled per user from 3 different subreddits
- Minimum post length: 50 words
- English language content only
- Random seed: 42
**Sample size:** 1,667 users × 3 posts = 5,001 posts across 645 unique subreddits
Match on `post_id` (Chameleon) = `id` field in Webis-TLDR-17.
---
## Extraction Pipeline
Profiles were extracted using a two-stage pipeline:
**Stage 1 — Feature Extraction:**
- *SEANCE* (Crossley et al., 2017): rule-based lexicon matching, 250+ linguistic indices
- *LangExtract* (2025): GPT-4o semantic pattern extraction
**Stage 2 — Scale Assessment:**
- GPT-4o prompted to respond to validated scale items as if it were the post's author, conditioned on extracted features
**Stage 3 — Normalization and Fusion:**
- Both methods z-normalized per dimension, then averaged to produce combined profiles
---
## Dataset Statistics
| Characteristic | Value |
|----------------|-------|
| Total posts | 5,001 |
| Unique users | 1,667 |
| Posts per user | 3 (by design) |
| Unique subreddits | 645 |
| Subreddits with n ≥ 10 posts | 41 |
| Psychological frameworks | 4 |
| Psychological dimensions | 26 |
| Extraction methods | 2 |
**Top subreddits by post count:** AskReddit (1,558), relationships (923), relationship_advice (268), offmychest (198), depression (129), dating_advice (76), self (53), personalfinance (49), SuicideWatch (43), AdviceAnimals (42)
---
## Ethics
This dataset constitutes secondary analysis of publicly available data from the Webis-TLDR-17 corpus (CC-BY-4.0), which contains Reddit posts made between 2006 and 2016.
- Usernames are pseudonymized — original Reddit usernames are not included
- Raw post text is not included to minimize privacy risks
- Profiles reflect psychological states expressed in text, not stable traits of individuals
- This research does not meet the federal definition of human subjects research under the Common Rule (45 CFR 46.102)
---
## Citation
```bibtex
@inproceedings{harry2026chameleon,
title = {Beyond Fixed Psychological Personas: State Beats Trait, but Language Models are State-Blind},
author = {Harry, Tamunotonye and Ngong, Ivoline C. and Nweke, Chima and Feng, Yuanyuan and Near, Joseph},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
year = {2026},
address = {San Diego, California}
}
```
---
## License
This dataset is released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/).
提供机构:
tonyeh



