five

3ebdola/wvs2persona

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/3ebdola/wvs2persona
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: Andorra data_files: - split: train path: "data/andorra/andorra_personas.jsonl" - config_name: Argentina data_files: - split: train path: "data/argentina/argentina_personas.jsonl" - config_name: Armenia data_files: - split: train path: "data/armenia/armenia_personas.jsonl" - config_name: Australia data_files: - split: train path: "data/australia/australia_personas.jsonl" - config_name: Bangladesh data_files: - split: train path: "data/bangladesh/bangladesh_personas.jsonl" - config_name: Bolivia data_files: - split: train path: "data/bolivia/bolivia_personas.jsonl" - config_name: Brazil data_files: - split: train path: "data/brazil/brazil_personas.jsonl" - config_name: Canada data_files: - split: train path: "data/canada/canada_personas.jsonl" - config_name: Chile data_files: - split: train path: "data/chile/chile_personas.jsonl" - config_name: China data_files: - split: train path: "data/china/china_personas.jsonl" - config_name: Colombia data_files: - split: train path: "data/colombia/colombia_personas.jsonl" - config_name: Cyprus data_files: - split: train path: "data/cyprus/cyprus_personas.jsonl" - config_name: Czechia data_files: - split: train path: "data/czechia/czechia_personas.jsonl" - config_name: Ecuador data_files: - split: train path: "data/ecuador/ecuador_personas.jsonl" - config_name: Egypt data_files: - split: train path: "data/egypt/egypt_personas.jsonl" - config_name: Ethiopia data_files: - split: train path: "data/ethiopia/ethiopia_personas.jsonl" - config_name: Germany data_files: - split: train path: "data/germany/germany_personas.jsonl" - config_name: Great_Britain data_files: - split: train path: "data/great_britain/great_britain_personas.jsonl" - config_name: Greece data_files: - split: train path: "data/greece/greece_personas.jsonl" - config_name: Guatemala data_files: - split: train path: "data/guatemala/guatemala_personas.jsonl" - config_name: Hong_Kong_Sar data_files: - split: train path: "data/hong_kong_sar/hong_kong_sar_personas.jsonl" - config_name: India data_files: - split: train path: "data/india/india_personas.jsonl" - config_name: Indonesia data_files: - split: train path: "data/indonesia/indonesia_personas.jsonl" - config_name: Iran data_files: - split: train path: "data/iran/iran_personas.jsonl" - config_name: Iraq data_files: - split: train path: "data/iraq/iraq_personas.jsonl" - config_name: Japan data_files: - split: train path: "data/japan/japan_personas.jsonl" - config_name: Jordan data_files: - split: train path: "data/jordan/jordan_personas.jsonl" - config_name: Kazakhstan data_files: - split: train path: "data/kazakhstan/kazakhstan_personas.jsonl" - config_name: Kenya data_files: - split: train path: "data/kenya/kenya_personas.jsonl" - config_name: Kyrgyzstan data_files: - split: train path: "data/kyrgyzstan/kyrgyzstan_personas.jsonl" - config_name: Lebanon data_files: - split: train path: "data/lebanon/lebanon_personas.jsonl" - config_name: Libya data_files: - split: train path: "data/libya/libya_personas.jsonl" - config_name: Macau_Sar data_files: - split: train path: "data/macau_sar/macau_sar_personas.jsonl" - config_name: Malaysia data_files: - split: train path: "data/malaysia/malaysia_personas.jsonl" - config_name: Maldives data_files: - split: train path: "data/maldives/maldives_personas.jsonl" - config_name: Mexico data_files: - split: train path: "data/mexico/mexico_personas.jsonl" - config_name: Mongolia data_files: - split: train path: "data/mongolia/mongolia_personas.jsonl" - config_name: Morocco data_files: - split: train path: "data/morocco/morocco_personas.jsonl" - config_name: Myanmar data_files: - split: train path: "data/myanmar/myanmar_personas.jsonl" - config_name: Netherlands data_files: - split: train path: "data/netherlands/netherlands_personas.jsonl" - config_name: New_Zealand data_files: - split: train path: "data/new_zealand/new_zealand_personas.jsonl" - config_name: Nicaragua data_files: - split: train path: "data/nicaragua/nicaragua_personas.jsonl" - config_name: Nigeria data_files: - split: train path: "data/nigeria/nigeria_personas.jsonl" - config_name: Northern_Ireland data_files: - split: train path: "data/northern_ireland/northern_ireland_personas.jsonl" - config_name: Pakistan data_files: - split: train path: "data/pakistan/pakistan_personas.jsonl" - config_name: Peru data_files: - split: train path: "data/peru/peru_personas.jsonl" - config_name: Philippines data_files: - split: train path: "data/philippines/philippines_personas.jsonl" - config_name: Puerto_Rico data_files: - split: train path: "data/puerto_rico/puerto_rico_personas.jsonl" - config_name: Romania data_files: - split: train path: "data/romania/romania_personas.jsonl" - config_name: Russia data_files: - split: train path: "data/russia/russia_personas.jsonl" - config_name: Serbia data_files: - split: train path: "data/serbia/serbia_personas.jsonl" - config_name: Singapore data_files: - split: train path: "data/singapore/singapore_personas.jsonl" - config_name: Slovakia data_files: - split: train path: "data/slovakia/slovakia_personas.jsonl" - config_name: South_Korea data_files: - split: train path: "data/south_korea/south_korea_personas.jsonl" - config_name: Taiwan_Roc data_files: - split: train path: "data/taiwan_roc/taiwan_roc_personas.jsonl" - config_name: Tajikistan data_files: - split: train path: "data/tajikistan/tajikistan_personas.jsonl" - config_name: Thailand data_files: - split: train path: "data/thailand/thailand_personas.jsonl" - config_name: Tunisia data_files: - split: train path: "data/tunisia/tunisia_personas.jsonl" - config_name: Turkey data_files: - split: train path: "data/turkey/turkey_personas.jsonl" - config_name: Ukraine data_files: - split: train path: "data/ukraine/ukraine_personas.jsonl" - config_name: United_States data_files: - split: train path: "data/united_states/united_states_personas.jsonl" - config_name: Uruguay data_files: - split: train path: "data/uruguay/uruguay_personas.jsonl" - config_name: Uzbekistan data_files: - split: train path: "data/uzbekistan/uzbekistan_personas.jsonl" - config_name: Venezuela data_files: - split: train path: "data/venezuela/venezuela_personas.jsonl" - config_name: Vietnam data_files: - split: train path: "data/vietnam/vietnam_personas.jsonl" - config_name: Zimbabwe data_files: - split: train path: "data/zimbabwe/zimbabwe_personas.jsonl" pretty_name: wvs2persona language: - en tags: - persona - survey - world-values-survey - sociology - culture size_categories: - 10K<n<100K task_categories: - text-generation --- # WVS2Persona: Parsed World Values Survey (WVS) Wave 7 records into textual personas <p align="center"> <img src="wvs2persona-header.png" alt="wvs2persona dataset overview"> </p> ## Dataset Description This dataset contains respondent-level persona descriptions derived from the **World Values Survey (WVS) Wave 7** core questionnaire. Each persona corresponds to **one individual survey record**. These are **not** cluster centroids, archetypes, or synthetic group summaries. The persona text is a deterministic natural-language rendering of the respondent's answers to the **core WVS questionnaire variables only**. On this repo, the dataset is organized **by country as subsets/configs**. Each subset contains a single `train` split. Only the following columns are released in the Hub version: - `persona_id` - `persona` ## Data Source The underlying survey source is: - **World Values Survey Wave 7** The persona generation pipeline uses only the **core questionnaire** sections: - Social Values, Attitudes & Stereotypes - Happiness and Well-Being - Social Capital, Trust & Organizational Membership - Economic Values - Corruption - Migration - Security - Postmaterialist Index - Science & Technology - Religious Values - Ethical Values and Norms - Political Interest & Political Participation - Political Culture & Political Regimes - Demographics No regional modules, contextual variables, or post-core thematic modules are used for the released personas. ## Methodology The methodology followed to create these personas is aligned with the approach described in the following paper: **NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities** - ACL Anthology: [https://aclanthology.org/2025.emnlp-main.556/](https://aclanthology.org/2025.emnlp-main.556/) In this dataset: 1. Each WVS respondent remains a separate record. 2. Only core questionnaire variables are used. 3. Structured survey values are decoded into human-readable labels. 4. A deterministic persona description is generated from the decoded responses. A **concise / summarized persona version**, following the NileChat-style approach, will be provided in a future commit. ## Dataset Structure ### Fields - `persona_id`: Stable identifier for the persona record. - `persona`: A full English persona description grounded in the respondent's WVS Wave 7 core-variable responses. ### Example ```python { "persona_id": "MAR_504720001", "persona": "This person lives in Morocco. She is female, was born in 1969, and is 52 years old. ..." } ``` ## Loading the Dataset ```python from datasets import load_dataset REPO_ID = "3ebdola/wvs2persona" ds_morocco = load_dataset(REPO_ID, "Morocco", split="train") print(ds_morocco) print(ds_morocco[0]) ``` Another example: ```python from datasets import load_dataset REPO_ID = "3ebdola/wvs2persona" ds_egypt = load_dataset(REPO_ID, "Egypt", split="train") print(ds_egypt[0]["persona_id"]) print(ds_egypt[0]["persona"]) ``` ## Subsets The dataset currently includes **66 country subsets** and **97,220 personas** in total. <details> <summary>Subset list and row counts</summary> | Subset | Rows | |---|---:| | Andorra | 1004 | | Argentina | 1003 | | Armenia | 1223 | | Australia | 1813 | | Bangladesh | 1200 | | Bolivia | 2067 | | Brazil | 1762 | | Canada | 4018 | | Chile | 1000 | | China | 3036 | | Colombia | 1520 | | Cyprus | 1000 | | Czechia | 1200 | | Ecuador | 1200 | | Egypt | 1200 | | Ethiopia | 1230 | | Germany | 1528 | | Great Britain | 2609 | | Greece | 1200 | | Guatemala | 1229 | | Hong Kong SAR | 2075 | | India | 1692 | | Indonesia | 3200 | | Iran | 1499 | | Iraq | 1200 | | Japan | 1353 | | Jordan | 1203 | | Kazakhstan | 1276 | | Kenya | 1266 | | Kyrgyzstan | 1200 | | Lebanon | 1200 | | Libya | 1196 | | Macau SAR | 1023 | | Malaysia | 1313 | | Maldives | 1039 | | Mexico | 1741 | | Mongolia | 1638 | | Morocco | 1200 | | Myanmar | 1200 | | Netherlands | 2145 | | New Zealand | 1057 | | Nicaragua | 1200 | | Nigeria | 1237 | | Northern Ireland | 447 | | Pakistan | 1995 | | Peru | 1400 | | Philippines | 1200 | | Puerto Rico | 1127 | | Romania | 1257 | | Russia | 1810 | | Serbia | 1046 | | Singapore | 2012 | | Slovakia | 1200 | | South Korea | 1245 | | Taiwan ROC | 1223 | | Tajikistan | 1200 | | Thailand | 1500 | | Tunisia | 1208 | | Turkey | 2415 | | Ukraine | 1289 | | United States | 2596 | | Uruguay | 1000 | | Uzbekistan | 1250 | | Venezuela | 1190 | | Vietnam | 1200 | | Zimbabwe | 1215 | </details> ## Intended Use This dataset can be useful for: - persona-based prompting and conditioning - culture-aware or country-aware LLM experimentation - evaluation of value-sensitive or socially grounded generation - retrieval and few-shot selection by country-specific persona text - downstream summarization or compression of long persona descriptions ## Limitations - The personas are **generated textual summaries**, not verbatim respondent statements. - They are grounded in survey answers, but should **not** be treated as complete biographies. - The released text is in **English**, even when the original respondents come from non-English-speaking countries. - Persona descriptions may reflect survey instrument limitations, response noise, or country-specific coding artifacts. - Only the **core WVS variables** are used in this release. ## Citation If you use this dataset, please cite the following paper: ```bibtex @inproceedings{el-mekki-etal-2025-nilechat, title = "{N}ile{C}hat: Towards Linguistically Diverse and Culturally Aware {LLM}s for Local Communities", author = "El Mekki, Abdellah and Atou, Houdaifa and Nacar, Omer and Shehata, Shady and Abdul-Mageed, Muhammad", editor = "Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet", booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.emnlp-main.556/", doi = "10.18653/v1/2025.emnlp-main.556", pages = "10967--10991", ISBN = "979-8-89176-332-6" } ```
提供机构:
3ebdola
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作