kiiop1/Nemotron-Personas-Korea

Name: kiiop1/Nemotron-Personas-Korea
Creator: kiiop1
Published: 2026-04-26 10:42:25
License: 暂无描述

Hugging Face2026-04-26 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/kiiop1/Nemotron-Personas-Korea

下载链接

链接失效反馈

官方服务：

资源简介：

Nemotron-Personas-Korea是基于韩国真实人口统计、地理和个性特征分布合成的开源人物角色数据集（CC BY 4.0），旨在广泛反映韩国人口的多样性和特征。这是首个大规模韩语人物角色数据集，包含姓名、性别、年龄、婚姻状况、教育水平、职业和居住地区等属性，这些属性是基于韩国统计厅（KOSIS）、韩国最高法院、国民健康保险公团、韩国农村经济研究院和NAVER Cloud的官方统计数据合成的。该数据集支持韩国模型开发者构建融入重要地区特定人口统计和文化背景的“主权AI”系统，可用于扩大主权AI模型开发的合成数据多样性、减轻数据和模型偏见，并提高模型响应的多样性。数据集使用NeMo Data Designer（企业级合成数据生成复合AI系统）创建，利用专有的概率图模型、Apache-2.0许可的google/gemma-4-31B-it模型以及Data Designer中包含的验证和评估方法。数据集包含100万条记录、700万个人物角色、26个字段（7个人物角色字段、6个人物角色属性字段、12个人口统计和地理上下文字段、1个唯一标识符），覆盖17个道和252个市郡区，包含20.9万个唯一姓名（118个姓氏，2.14万个名字），以及7种人物角色类型（职业、体育、艺术、旅行、烹饪、家庭、简洁）和额外的自然语言人物角色属性（文化背景、技能与专业知识、职业目标与抱负、爱好与兴趣）。

Nemotron-Personas-Korea is an open-source persona dataset (CC BY 4.0) synthesized based on real-world demographic, geographic, and personality trait distributions of South Korea. It is designed to broadly reflect the diversity and characteristics of the South Korean population. As the first large-scale Korean-language persona dataset, it includes attributes such as name, sex, age, marital status, education level, occupation, and region of residence, all synthesized using official statistics from the Korean Statistical Information Service (KOSIS), the Supreme Court of Korea, the National Health Insurance Service, the Korea Rural Economic Institute, and NAVER Cloud. The dataset supports South Korean model builders in developing Sovereign AI systems that incorporate important region-specific demographics and cultural context. It can be used to expand the diversity of synthetic data for sovereign AI model development, mitigate data and model bias, and improve the diversity of model responses. The dataset was created using NeMo Data Designer, an enterprise-grade compound AI system for synthetic data generation, leveraging a proprietary probabilistic graphical model (PGM), the Apache-2.0 licensed google/gemma-4-31B-it model, and validation and evaluation methods included in Data Designer. It contains 1M records, 7M personas, 26 fields (7 persona fields, 6 persona attribute fields, 12 demographic & geographic contextual fields, 1 unique identifier), comprehensive coverage across 17 provinces and 252 districts, 209K unique names (118 surnames, 21.4K given names), 7 persona types (professional, sports, arts, travel, culinary, family, concise), and additional natural language persona attributes (cultural background, skills & expertise, career goals & ambitions, hobbies & interests).

提供机构：

kiiop1

5,000+

优质数据集

54 个

任务类型

进入经典数据集