3ebdola/wvs2persona
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/3ebdola/wvs2persona
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: Andorra
data_files:
- split: train
path: "data/andorra/andorra_personas.jsonl"
- config_name: Argentina
data_files:
- split: train
path: "data/argentina/argentina_personas.jsonl"
- config_name: Armenia
data_files:
- split: train
path: "data/armenia/armenia_personas.jsonl"
- config_name: Australia
data_files:
- split: train
path: "data/australia/australia_personas.jsonl"
- config_name: Bangladesh
data_files:
- split: train
path: "data/bangladesh/bangladesh_personas.jsonl"
- config_name: Bolivia
data_files:
- split: train
path: "data/bolivia/bolivia_personas.jsonl"
- config_name: Brazil
data_files:
- split: train
path: "data/brazil/brazil_personas.jsonl"
- config_name: Canada
data_files:
- split: train
path: "data/canada/canada_personas.jsonl"
- config_name: Chile
data_files:
- split: train
path: "data/chile/chile_personas.jsonl"
- config_name: China
data_files:
- split: train
path: "data/china/china_personas.jsonl"
- config_name: Colombia
data_files:
- split: train
path: "data/colombia/colombia_personas.jsonl"
- config_name: Cyprus
data_files:
- split: train
path: "data/cyprus/cyprus_personas.jsonl"
- config_name: Czechia
data_files:
- split: train
path: "data/czechia/czechia_personas.jsonl"
- config_name: Ecuador
data_files:
- split: train
path: "data/ecuador/ecuador_personas.jsonl"
- config_name: Egypt
data_files:
- split: train
path: "data/egypt/egypt_personas.jsonl"
- config_name: Ethiopia
data_files:
- split: train
path: "data/ethiopia/ethiopia_personas.jsonl"
- config_name: Germany
data_files:
- split: train
path: "data/germany/germany_personas.jsonl"
- config_name: Great_Britain
data_files:
- split: train
path: "data/great_britain/great_britain_personas.jsonl"
- config_name: Greece
data_files:
- split: train
path: "data/greece/greece_personas.jsonl"
- config_name: Guatemala
data_files:
- split: train
path: "data/guatemala/guatemala_personas.jsonl"
- config_name: Hong_Kong_Sar
data_files:
- split: train
path: "data/hong_kong_sar/hong_kong_sar_personas.jsonl"
- config_name: India
data_files:
- split: train
path: "data/india/india_personas.jsonl"
- config_name: Indonesia
data_files:
- split: train
path: "data/indonesia/indonesia_personas.jsonl"
- config_name: Iran
data_files:
- split: train
path: "data/iran/iran_personas.jsonl"
- config_name: Iraq
data_files:
- split: train
path: "data/iraq/iraq_personas.jsonl"
- config_name: Japan
data_files:
- split: train
path: "data/japan/japan_personas.jsonl"
- config_name: Jordan
data_files:
- split: train
path: "data/jordan/jordan_personas.jsonl"
- config_name: Kazakhstan
data_files:
- split: train
path: "data/kazakhstan/kazakhstan_personas.jsonl"
- config_name: Kenya
data_files:
- split: train
path: "data/kenya/kenya_personas.jsonl"
- config_name: Kyrgyzstan
data_files:
- split: train
path: "data/kyrgyzstan/kyrgyzstan_personas.jsonl"
- config_name: Lebanon
data_files:
- split: train
path: "data/lebanon/lebanon_personas.jsonl"
- config_name: Libya
data_files:
- split: train
path: "data/libya/libya_personas.jsonl"
- config_name: Macau_Sar
data_files:
- split: train
path: "data/macau_sar/macau_sar_personas.jsonl"
- config_name: Malaysia
data_files:
- split: train
path: "data/malaysia/malaysia_personas.jsonl"
- config_name: Maldives
data_files:
- split: train
path: "data/maldives/maldives_personas.jsonl"
- config_name: Mexico
data_files:
- split: train
path: "data/mexico/mexico_personas.jsonl"
- config_name: Mongolia
data_files:
- split: train
path: "data/mongolia/mongolia_personas.jsonl"
- config_name: Morocco
data_files:
- split: train
path: "data/morocco/morocco_personas.jsonl"
- config_name: Myanmar
data_files:
- split: train
path: "data/myanmar/myanmar_personas.jsonl"
- config_name: Netherlands
data_files:
- split: train
path: "data/netherlands/netherlands_personas.jsonl"
- config_name: New_Zealand
data_files:
- split: train
path: "data/new_zealand/new_zealand_personas.jsonl"
- config_name: Nicaragua
data_files:
- split: train
path: "data/nicaragua/nicaragua_personas.jsonl"
- config_name: Nigeria
data_files:
- split: train
path: "data/nigeria/nigeria_personas.jsonl"
- config_name: Northern_Ireland
data_files:
- split: train
path: "data/northern_ireland/northern_ireland_personas.jsonl"
- config_name: Pakistan
data_files:
- split: train
path: "data/pakistan/pakistan_personas.jsonl"
- config_name: Peru
data_files:
- split: train
path: "data/peru/peru_personas.jsonl"
- config_name: Philippines
data_files:
- split: train
path: "data/philippines/philippines_personas.jsonl"
- config_name: Puerto_Rico
data_files:
- split: train
path: "data/puerto_rico/puerto_rico_personas.jsonl"
- config_name: Romania
data_files:
- split: train
path: "data/romania/romania_personas.jsonl"
- config_name: Russia
data_files:
- split: train
path: "data/russia/russia_personas.jsonl"
- config_name: Serbia
data_files:
- split: train
path: "data/serbia/serbia_personas.jsonl"
- config_name: Singapore
data_files:
- split: train
path: "data/singapore/singapore_personas.jsonl"
- config_name: Slovakia
data_files:
- split: train
path: "data/slovakia/slovakia_personas.jsonl"
- config_name: South_Korea
data_files:
- split: train
path: "data/south_korea/south_korea_personas.jsonl"
- config_name: Taiwan_Roc
data_files:
- split: train
path: "data/taiwan_roc/taiwan_roc_personas.jsonl"
- config_name: Tajikistan
data_files:
- split: train
path: "data/tajikistan/tajikistan_personas.jsonl"
- config_name: Thailand
data_files:
- split: train
path: "data/thailand/thailand_personas.jsonl"
- config_name: Tunisia
data_files:
- split: train
path: "data/tunisia/tunisia_personas.jsonl"
- config_name: Turkey
data_files:
- split: train
path: "data/turkey/turkey_personas.jsonl"
- config_name: Ukraine
data_files:
- split: train
path: "data/ukraine/ukraine_personas.jsonl"
- config_name: United_States
data_files:
- split: train
path: "data/united_states/united_states_personas.jsonl"
- config_name: Uruguay
data_files:
- split: train
path: "data/uruguay/uruguay_personas.jsonl"
- config_name: Uzbekistan
data_files:
- split: train
path: "data/uzbekistan/uzbekistan_personas.jsonl"
- config_name: Venezuela
data_files:
- split: train
path: "data/venezuela/venezuela_personas.jsonl"
- config_name: Vietnam
data_files:
- split: train
path: "data/vietnam/vietnam_personas.jsonl"
- config_name: Zimbabwe
data_files:
- split: train
path: "data/zimbabwe/zimbabwe_personas.jsonl"
pretty_name: wvs2persona
language:
- en
tags:
- persona
- survey
- world-values-survey
- sociology
- culture
size_categories:
- 10K<n<100K
task_categories:
- text-generation
---
# WVS2Persona: Parsed World Values Survey (WVS) Wave 7 records into textual personas
<p align="center">
<img src="wvs2persona-header.png" alt="wvs2persona dataset overview">
</p>
## Dataset Description
This dataset contains respondent-level persona descriptions derived from the **World Values Survey (WVS) Wave 7** core questionnaire.
Each persona corresponds to **one individual survey record**. These are **not** cluster centroids, archetypes, or synthetic group summaries. The persona text is a deterministic natural-language rendering of the respondent's answers to the **core WVS questionnaire variables only**.
On this repo, the dataset is organized **by country as subsets/configs**. Each subset contains a single `train` split.
Only the following columns are released in the Hub version:
- `persona_id`
- `persona`
## Data Source
The underlying survey source is:
- **World Values Survey Wave 7**
The persona generation pipeline uses only the **core questionnaire** sections:
- Social Values, Attitudes & Stereotypes
- Happiness and Well-Being
- Social Capital, Trust & Organizational Membership
- Economic Values
- Corruption
- Migration
- Security
- Postmaterialist Index
- Science & Technology
- Religious Values
- Ethical Values and Norms
- Political Interest & Political Participation
- Political Culture & Political Regimes
- Demographics
No regional modules, contextual variables, or post-core thematic modules are used for the released personas.
## Methodology
The methodology followed to create these personas is aligned with the approach described in the following paper:
**NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities**
- ACL Anthology: [https://aclanthology.org/2025.emnlp-main.556/](https://aclanthology.org/2025.emnlp-main.556/)
In this dataset:
1. Each WVS respondent remains a separate record.
2. Only core questionnaire variables are used.
3. Structured survey values are decoded into human-readable labels.
4. A deterministic persona description is generated from the decoded responses.
A **concise / summarized persona version**, following the NileChat-style approach, will be provided in a future commit.
## Dataset Structure
### Fields
- `persona_id`:
Stable identifier for the persona record.
- `persona`:
A full English persona description grounded in the respondent's WVS Wave 7 core-variable responses.
### Example
```python
{
"persona_id": "MAR_504720001",
"persona": "This person lives in Morocco. She is female, was born in 1969, and is 52 years old. ..."
}
```
## Loading the Dataset
```python
from datasets import load_dataset
REPO_ID = "3ebdola/wvs2persona"
ds_morocco = load_dataset(REPO_ID, "Morocco", split="train")
print(ds_morocco)
print(ds_morocco[0])
```
Another example:
```python
from datasets import load_dataset
REPO_ID = "3ebdola/wvs2persona"
ds_egypt = load_dataset(REPO_ID, "Egypt", split="train")
print(ds_egypt[0]["persona_id"])
print(ds_egypt[0]["persona"])
```
## Subsets
The dataset currently includes **66 country subsets** and **97,220 personas** in total.
<details>
<summary>Subset list and row counts</summary>
| Subset | Rows |
|---|---:|
| Andorra | 1004 |
| Argentina | 1003 |
| Armenia | 1223 |
| Australia | 1813 |
| Bangladesh | 1200 |
| Bolivia | 2067 |
| Brazil | 1762 |
| Canada | 4018 |
| Chile | 1000 |
| China | 3036 |
| Colombia | 1520 |
| Cyprus | 1000 |
| Czechia | 1200 |
| Ecuador | 1200 |
| Egypt | 1200 |
| Ethiopia | 1230 |
| Germany | 1528 |
| Great Britain | 2609 |
| Greece | 1200 |
| Guatemala | 1229 |
| Hong Kong SAR | 2075 |
| India | 1692 |
| Indonesia | 3200 |
| Iran | 1499 |
| Iraq | 1200 |
| Japan | 1353 |
| Jordan | 1203 |
| Kazakhstan | 1276 |
| Kenya | 1266 |
| Kyrgyzstan | 1200 |
| Lebanon | 1200 |
| Libya | 1196 |
| Macau SAR | 1023 |
| Malaysia | 1313 |
| Maldives | 1039 |
| Mexico | 1741 |
| Mongolia | 1638 |
| Morocco | 1200 |
| Myanmar | 1200 |
| Netherlands | 2145 |
| New Zealand | 1057 |
| Nicaragua | 1200 |
| Nigeria | 1237 |
| Northern Ireland | 447 |
| Pakistan | 1995 |
| Peru | 1400 |
| Philippines | 1200 |
| Puerto Rico | 1127 |
| Romania | 1257 |
| Russia | 1810 |
| Serbia | 1046 |
| Singapore | 2012 |
| Slovakia | 1200 |
| South Korea | 1245 |
| Taiwan ROC | 1223 |
| Tajikistan | 1200 |
| Thailand | 1500 |
| Tunisia | 1208 |
| Turkey | 2415 |
| Ukraine | 1289 |
| United States | 2596 |
| Uruguay | 1000 |
| Uzbekistan | 1250 |
| Venezuela | 1190 |
| Vietnam | 1200 |
| Zimbabwe | 1215 |
</details>
## Intended Use
This dataset can be useful for:
- persona-based prompting and conditioning
- culture-aware or country-aware LLM experimentation
- evaluation of value-sensitive or socially grounded generation
- retrieval and few-shot selection by country-specific persona text
- downstream summarization or compression of long persona descriptions
## Limitations
- The personas are **generated textual summaries**, not verbatim respondent statements.
- They are grounded in survey answers, but should **not** be treated as complete biographies.
- The released text is in **English**, even when the original respondents come from non-English-speaking countries.
- Persona descriptions may reflect survey instrument limitations, response noise, or country-specific coding artifacts.
- Only the **core WVS variables** are used in this release.
## Citation
If you use this dataset, please cite the following paper:
```bibtex
@inproceedings{el-mekki-etal-2025-nilechat,
title = "{N}ile{C}hat: Towards Linguistically Diverse and Culturally Aware {LLM}s for Local Communities",
author = "El Mekki, Abdellah and
Atou, Houdaifa and
Nacar, Omer and
Shehata, Shady and
Abdul-Mageed, Muhammad",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.556/",
doi = "10.18653/v1/2025.emnlp-main.556",
pages = "10967--10991",
ISBN = "979-8-89176-332-6"
}
```
提供机构:
3ebdola



