Tahn/Nemotron-Personas-Korea
收藏Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Tahn/Nemotron-Personas-Korea
下载链接
链接失效反馈官方服务:
资源简介:
Nemotron-Personas-Korea是一个基于韩国真实人口统计、地理和性格特征分布合成的开源人物角色数据集(CC BY 4.0许可证),旨在广泛反映韩国人口的多样性和特征。作为首个大规模韩语人物角色数据集,它包含100万条记录和700万个人物角色,涵盖26个字段,包括7个人物角色字段(如职业、体育、艺术、旅行、烹饪、家庭和摘要角色)、6个人物角色属性字段(如文化背景、技能与专业知识、职业目标与抱负、爱好与兴趣列表)和12个人口统计与地理上下文字段(如性别、年龄、婚姻状况、教育水平、职业、居住地区等),以及一个唯一标识符。数据集基于韩国统计厅(KOSIS)、最高法院、国民健康保险公团、农村经济研究院和NAVER Cloud的官方统计数据合成,使用NeMo Data Designer企业级合成数据生成复合AI系统创建,包括专有概率图模型和google/gemma-4-31B-it模型。数据集支持韩国模型开发者构建包含地区特定人口统计和文化背景的主权AI系统,可用于扩展合成数据多样性、缓解数据和模型偏见、提高模型响应多样性,并更真实地反映年龄(如老年人口)、地区(如农村)、教育水平和职业等多维度的人口分布。数据集仅包含韩国法律规定的成人年龄(19岁及以上)的人物角色,不包括姓名、性格特质等字段,也不涉及金融、医疗等企业相关角色,所有数据均为人工合成,任何与真实人物的相似性纯属巧合。
Nemotron-Personas-Korea is an open-source persona dataset (CC BY 4.0 license) synthesized based on real-world demographic, geographic, and personality trait distributions of South Korea, designed to broadly reflect the diversity and characteristics of the South Korean population. As the first large-scale Korean-language persona dataset, it includes 1 million records with 7 million personas across 26 fields: 7 persona fields (e.g., professional, sports, arts, travel, culinary, family, and concise personas), 6 persona attribute fields (e.g., cultural background, skills and expertise, career goals and ambitions, hobbies and interests lists), and 12 demographic and geographic contextual fields (e.g., sex, age, marital status, education level, occupation, region of residence), along with a unique identifier. The dataset is synthesized using official statistics from the Korean Statistical Information Service (KOSIS), the Supreme Court of Korea, the National Health Insurance Service, the Korea Rural Economic Institute, and NAVER Cloud, and created using NeMo Data Designer, an enterprise-grade compound AI system for synthetic data generation that leverages a proprietary probabilistic graphical model and the google/gemma-4-31B-it model. It supports South Korean model builders in developing Sovereign AI systems that incorporate region-specific demographics and cultural context, and can be used to expand the diversity of synthetic data, mitigate data and model bias, and improve the diversity of model responses, more faithfully reflecting real population distributions across dimensions such as age (e.g., elderly populations), region (e.g., rural areas), education level, and occupation. The dataset includes only personas of adult age (19 years and older by South Korean law), excludes other fields like names and personality traits, and does not cover personas relevant to enterprise clients (e.g., finance, healthcare); all data is artificially generated, and any similarity to actual persons is purely coincidental.
提供机构:
Tahn



