five

Daniel10004/Nemotron-Personas-Korea

收藏
Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Daniel10004/Nemotron-Personas-Korea
下载链接
链接失效反馈
官方服务:
资源简介:
Nemotron-Personas-Korea是一个基于韩国真实人口统计、地理和性格特征分布合成的开源人物角色数据集(CC BY 4.0),旨在广泛反映韩国人口的多样性和特性。作为首个大规模的韩语人物角色数据集,它包含了姓名、性别、年龄、婚姻状况、教育水平、职业和居住地区等属性,这些属性是基于韩国统计信息服务(KOSIS)、韩国最高法院、国民健康保险公团、农村经济研究院和NAVER Cloud的官方统计数据合成的。该数据集支持韩国模型开发者构建具有重要区域特定人口统计和文化背景的Sovereign AI系统。数据集可用于扩展主权AI模型开发的合成数据多样性,缓解数据和模型偏差,并提高模型响应的多样性。特别是与现有的人物角色数据集相比,它更忠实地反映了实际人口分布在多个维度上的情况,包括年龄(如老年人口)、地区(如农村地区)、教育水平和职业等。数据集使用企业级合成数据生成复合AI系统NeMo Data Designer创建,利用了专有的概率图模型(PGM)、Apache-2.0许可的google/gemma-4-31B-it模型以及Data Designer中包含的验证和评估方法。

Nemotron-Personas-Korea is an open-source persona dataset (CC BY 4.0) synthesized based on real-world demographic, geographic, and personality trait distributions of South Korea. It is designed to broadly reflect the diversity and characteristics of the South Korean population. As the first large-scale Korean-language persona dataset, it includes attributes such as name, sex, age, marital status, education level, occupation, and region of residence, all synthesized using official statistics from the Korean Statistical Information Service (KOSIS), the Supreme Court of Korea, the National Health Insurance Service, and the Korea Rural Economic Institute, and NAVER Cloud. Nemotron-Personas-Korea supports South Korean model builders in developing Sovereign AI systems that incorporate important region-specific demographics and cultural context. This dataset can be used to expand the diversity of synthetic data for sovereign AI model development, mitigate data and model bias, and improve the diversity of model responses. In particular, compared to existing persona datasets, it is designed to more faithfully reflect real population distributions across multiple dimensions, including age (e.g., elderly populations), region (e.g., rural areas), education level, and occupation. The dataset was created using NeMo Data Designer, an enterprise-grade compound AI system for synthetic data generation. It leverages a proprietary probabilistic graphical model (PGM), the Apache-2.0 licensed google/gemma-4-31B-it model, and the validation and evaluation methods included in Data Designer.
提供机构:
Daniel10004
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作