five

teias-ai/synthia

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/teias-ai/synthia
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: index dtype: string - name: persona dtype: string - name: followings list: string - name: followers list: string - name: extractive_persona dtype: string - name: recency_persona dtype: string splits: - name: synthia_gemma_3_27b_it num_bytes: 17799982 num_examples: 2848 - name: synthia_phi_4_mini_instruct num_bytes: 16676667 num_examples: 2848 - name: anthology_llama_3_8b num_bytes: 6148856 num_examples: 3000 - name: anthology_gemma_3_27b num_bytes: 8470793 num_examples: 3000 - name: persona_chat num_bytes: 638593 num_examples: 3000 download_size: 29176298 dataset_size: 49734891 configs: - config_name: default data_files: - split: synthia_gemma_3_27b_it path: data/synthia_gemma_3_27b_it-* - split: synthia_phi_4_mini_instruct path: data/synthia_phi_4_mini_instruct-* - split: anthology_llama_3_8b path: data/anthology_llama_3_8b-* - split: anthology_gemma_3_27b path: data/anthology_gemma_3_27b-* - split: persona_chat path: data/persona_chat-* task_categories: - text-generation language: - en tags: - Persona - Social-Media - Role-Playing - User-Modeling - Survey-Simulation - Agent - Profiling - Synthetic-Data - Human-Simulation - Opinion-Modeling pretty_name: SYNTHIA size_categories: - 10K<n<100K --- # Dataset Card for Dataset Name SYNTHIA (Synthetic Yet Naturally Tailored Human-Inspired PersonA) is a large-scale dataset of grounded personas generated from real-world social media data, paired with an underlying social interaction graph. It is designed to support research in persona-based simulation, computational social science, and LLM-driven population modeling. ### Dataset Description Large language models are increasingly used to simulate human behavior, opinions, and social dynamics. However, existing persona datasets often suffer from a trade-off between authenticity (grounded in real human data) and scalability (synthetically generated at scale). To address this, we introduce SYNTHIA, a dataset of high-fidelity personas generated by grounding LLM outputs in real social media posts. Each persona is a first-person narrative constructed from a user’s historical activity, while preserving privacy through extensive anonymization and filtering. Unlike prior persona datasets, SYNTHIA: - Grounds personas in real user-generated content rather than purely synthetic sampling - Preserves the social network structure between users as a graph - Enables network-aware analysis, including homophily and link prediction - Is designed for population-level simulation, not individual reconstruction - **Curated by:** Joint work of [Vahid Rahimzadeh](https://scholar.google.com/citations?user=CTiPTggAAAAJ&hl=en) & [Erfan Moosavi Monazzah](https://scholar.google.com/citations?user=243ygCwAAAAJ&hl=en) - **Funded by:** [Tehran Institute for Advanced Studies (TeIAS)](https://teias.institute/) - **Shared by:** [LLMs Lab @ TeIAS](https://teias.ai/) - **Language(s) (NLP):** English (EN) ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Repository:** [teias-ai/synthia](https://huggingface.co/datasets/teias-ai/synthia) - **Paper:** ACL Anthology (Coming Soon!) | [arXiv](https://arxiv.org/abs/2507.14922) ## Intended Uses This dataset is intended for: - Population-level simulations and survey response simulation - Computational social science experiments - Graph-based learning over social networks - Studying bias, fairness, and representation in synthetic populations - Training and evaluating persona-driven LLMs ## Dataset Structure This dataset has the following splits: | split name | Description | |-----------------------------|------------------------------------------------------------------------------------------------------------| | synthia_gemma_3_27b_it | Original created data, used in **Screening Stage** & **Detailed Analysis** sections of the paper | | synthia_phi_4_mini_instruct | Original created data with smaller model, used only in **Screening Stage** section of the paper | | anthology_llama_3_8b | Subset of data from **Anthology** dataset. See their work [here](https://github.com/CannyLab/anthology) | | anthology_gemma_3_27b | The re-run of **Anthology** pipeline with the same model as **SYNTHIA** to get a fair baseline | | persona_chat | Subset of data from **persona-chat** dataset. See their work [here](https://huggingface.co/datasets/AlekseyKorshuk/persona-chat) | | A sample row of the dataset: ```python { 'index': '02b0de5bd3d81bd1c22bc0ab518bcfca', <str> 'persona': 'My twenties were… a lot...', <str> 'followings': ['5bdcd696b9f20401e12332e75182b3c3', ..., '1361db0f2fea2ed3d3a8feae362aaeb9'], List[<str>] 'followers': ['1361db0f2fea2ed3d3a8feae362aaeb9', ..., '5bdcd696b9f20401e12332e75182b3c3'], List[<str>] 'extractive_persona': 'Since August. She’s 9mo I will 20...', <str> 'recency_persona': 'I don’t even know how to be supportive', <str> } ``` ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** ``` @misc{rahimzadeh2026synthiascalablegroundedpersona, title={Synthia: Scalable Grounded Persona Generation from Social Media Data}, author={Vahid Rahimzadeh and Erfan Moosavi Monazzah and Mohammad Taher Pilehvar and Yadollah Yaghoobzadeh}, year={2026}, eprint={2507.14922}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.14922}, } ```
提供机构:
teias-ai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作