five

ChaNation/Nemotron-Personas-Korea

收藏
Hugging Face2026-04-25 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/ChaNation/Nemotron-Personas-Korea
下载链接
链接失效反馈
官方服务:
资源简介:
Nemotron-Personas-Korea是一个基于韩国真实人口统计、地理和人格特征分布合成的开源人物角色数据集,设计用于广泛反映韩国人口的多样性和特征。作为首个大规模韩语人物角色数据集,它包含姓名、性别、年龄、婚姻状况、教育水平、职业和居住地区等属性,这些属性基于韩国统计信息局、最高法院、国民健康保险公团、农村经济研究院和NAVER Cloud的官方统计数据合成。数据集旨在支持韩国模型开发者构建包含重要地区特定人口统计和文化背景的主权AI系统,可用于扩大合成数据的多样性、减轻数据和模型偏见,并提高模型响应的多样性。数据集使用企业级合成数据生成复合AI系统NeMo Data Designer创建,包含100万条记录、26个字段(如7个人物角色字段、6个人物角色属性字段、12个人口统计和地理上下文字段以及1个唯一标识符),覆盖17个省份和252个地区,提供209,000个唯一姓名和7种人物角色类型(如职业、体育、艺术、旅行、烹饪、家庭和简洁型)。数据集遵循CC BY 4.0许可证,适用于商业和非商业用途,仅包含韩国法律定义的成年年龄(19岁及以上)的人物角色。

Nemotron-Personas-Korea is an open-source persona dataset synthesized based on real-world demographic, geographic, and personality trait distributions of South Korea, designed to broadly reflect the diversity and characteristics of the South Korean population. As the first large-scale Korean-language persona dataset, it includes attributes such as name, sex, age, marital status, education level, occupation, and region of residence, all synthesized using official statistics from the Korean Statistical Information Service (KOSIS), the Supreme Court of Korea, the National Health Insurance Service, the Korea Rural Economic Institute, and NAVER Cloud. The dataset supports South Korean model builders in developing Sovereign AI systems that incorporate important region-specific demographics and cultural context, and can be used to expand the diversity of synthetic data, mitigate data and model bias, and improve the diversity of model responses. It was created using NeMo Data Designer, an enterprise-grade compound AI system for synthetic data generation, and contains 1 million records with 26 fields (including 7 persona fields, 6 persona attribute fields, 12 demographic and geographic contextual fields, and a unique identifier), covering 17 provinces and 252 districts, with 209,000 unique names and 7 persona types (e.g., professional, sports, arts, travel, culinary, family, concise). The dataset is licensed under CC BY 4.0, available for both commercial and non-commercial use, and includes only personas of adult age (19 years and older by South Korean law).
提供机构:
ChaNation
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作