anlee-0618/Nemotron-Personas-USA-10K
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/anlee-0618/Nemotron-Personas-USA-10K
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: uuid
dtype: string
- name: professional_persona
dtype: string
- name: sports_persona
dtype: string
- name: arts_persona
dtype: string
- name: travel_persona
dtype: string
- name: culinary_persona
dtype: string
- name: persona
dtype: string
- name: cultural_background
dtype: string
- name: skills_and_expertise
dtype: string
- name: skills_and_expertise_list
dtype: string
- name: hobbies_and_interests
dtype: string
- name: hobbies_and_interests_list
dtype: string
- name: career_goals_and_ambitions
dtype: string
- name: sex
dtype: string
- name: age
dtype: int64
- name: marital_status
dtype: string
- name: education_level
dtype: string
- name: bachelors_field
dtype: string
- name: occupation
dtype: string
- name: city
dtype: string
- name: state
dtype: string
- name: zipcode
dtype: string
- name: country
dtype: string
splits:
- name: train
num_bytes: 53780592
num_examples: 10000
download_size: 26552972
dataset_size: 53780592
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Diverse Nemotron-Personas-USA 10K Subset
A 10,000-sample, diversity-focused subset of NVIDIA’s **Nemotron-Personas-USA** synthetic persona dataset, curated to provide a compact but representative slice of U.S.-aligned synthetic personas for research and model development.
---
## 1. Dataset Summary
- **Name:** Diverse Nemotron-Personas-USA 10K Subset
- **Source dataset:** `nvidia/Nemotron-Personas-USA`
- **Records:** 10,000 personas
- **Modality:** Text
- **Primary language:** English
- **License:** CC BY 4.0 (see [License](#6-license) and [Attribution](#7-attribution))
- **Intended users:** Researchers, practitioners, and developers
All personas are **synthetically generated**; any resemblance to real individuals is coincidental.
---
## 2. Origin and Relationship to Nemotron-Personas-USA
This dataset is a **downstream curated subset** of the open-source [Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) dataset created by **NVIDIA Corporation**.
Key points:
- The original dataset contains **1M records** and **0.94B tokens** of synthetic U.S.-aligned personas.
- Personas are grounded in **real-world demographic distributions** (e.g., age, geography, education, occupation) derived from U.S. Census data and related public statistical sources.
- NVIDIA produced the original dataset using **NeMo Data Designer**, a compound AI system, and the `openai/gpt-oss-120b` model.
This 10K subset:
- Is **strictly sampled** from the original `nvidia/Nemotron-Personas-USA` dataset.
- Does **not** introduce any new content beyond the original personas and fields (except for this documentation and any optional metadata added by the curator).
- Inherits the **same license obligations** and **attribution requirements** as the source dataset (CC BY 4.0).
For full details on the original dataset, consult the upstream dataset card on Hugging Face.
---
## 3. Sampling & Curation
TODO
---
## 4. Data Fields
The original Nemotron-Personas-USA dataset defines **22 fields** per record: **6 persona fields** and **16 contextual fields** (plus a unique identifier). This 10K subset preserves that structure unless otherwise noted.
Common fields include (non-exhaustive illustrative list, matching the upstream schema):
- **uuid:** Unique identifier for each persona (string).
- **professional_persona:** Free-text description of the persona’s professional profile.
- **sports_persona:** Free-text description of sports/fitness preferences and habits.
- **arts_persona:** Free-text description of arts, culture, and creative interests.
- **travel_persona:** Free-text description of travel style and preferences.
- **culinary_persona:** Free-text description of food, cooking, and dining habits.
- **persona:** High-level summary persona description.
- **cultural_background:** Narrative description of cultural, regional, and identity context.
- **skills_and_expertise:** Free-text overview of skills and expertise.
- **skills_and_expertise_list:** List-like string of specific skills.
- **hobbies_and_interests:** Free-text description of hobbies.
- **hobbies_and_interests_list:** List-like string of hobbies.
- **career_goals_and_ambitions:** Free-text description of medium/long-term goals.
- **sex:** Sex assigned in the dataset (e.g., "Male", "Female"), based on Census-aligned distributions.
- **age:** Integer age (adults only in the original dataset).
- **marital_status:** Categorical marital status (e.g., `never_married`, `married_present`, `divorced`, `widowed`, etc.).
- **education_level:** Categorical education level (e.g., `high_school`, `bachelors`, `graduate`, `some_college`, etc.).
- **bachelors_field:** For some personas, field of bachelor’s degree (e.g., `stem`, `education`, `business`).
- **occupation:** Categorical occupation label (over 560 occupations in the original dataset).
- **city, state, zipcode, country:** Geographic context aligned with U.S. Census distributions.
Consult the original Nemotron-Personas-USA dataset card for authoritative definitions of each field. Any deviations in this subset (e.g., dropped fields, additional metadata) should be documented in this section by the curator.
---
## 5. Intended Use & Limitations
### Intended Use
This 10K subset is designed to be a **compact, diverse benchmark and training resource** for:
- Evaluating persona-aware and instruction-following LLMs.
- Testing **fairness, bias, and coverage** across demographic axes.
- Simulating user populations for:
- Recommendation and personalization experiments
- Agent-based simulations
- Synthetic evaluation of dialog systems
- Rapid prototyping in environments where the full 2.6GB / 1M-record dataset is impractical.
Both **commercial** and **non-commercial** uses are permitted, subject to the terms of the **CC BY 4.0** license and proper attribution.
### Limitations
- **Synthetic data:**
- All personas are *artificially generated*. They do not correspond to real individuals, even if names, professions, or locations resemble real-world entities.
- **Model & seed data constraints:**
- Persona distributions are grounded in U.S. Census and related statistical sources, but remain approximations subject to modeling assumptions (e.g., independence assumptions between some attributes).
- **U.S.-centric:**
- This dataset is **U.S.-focused** (plus Puerto Rico and Virgin Islands) and is not representative of global population distributions.
- **Not ground truth for sensitive attributes:**
- While useful for diversity and robustness testing, this dataset should not be treated as ground-truth demographic data for individuals or used for real-person profiling.
Developers remain responsible for evaluating suitability, risk, and compliance in their specific domain (e.g., healthcare, finance, hiring).
---
## 6. License
The original dataset `nvidia/Nemotron-Personas-USA` is licensed under:
> **Creative Commons Attribution 4.0 International (CC BY 4.0)**
This curated 10K subset is **also released under CC BY 4.0**, as required by the upstream license.
In practice, this means:
- You **may**:
- Use, share, and adapt the dataset.
- Use it for commercial and non-commercial purposes.
- You **must**:
- Provide **appropriate credit** to NVIDIA and the original dataset authors.
- Indicate if you made changes (e.g., “sampled, filtered, or transformed”).
- Include a link to the CC BY 4.0 license and, where practical, to the original dataset.
For full legal terms, see:
[Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)
---
## 7. Attribution
When using this 10K subset in research, publications, or products, please:
1. Attribute both the **original dataset** and this **curated subset**.
2. Clearly indicate that your dataset is a **derived subset** of Nemotron-Personas-USA.
3. Include the official citation provided by NVIDIA.
### Recommended Attribution Text
> This work uses a curated 10K subset derived from the Nemotron-Personas-USA dataset by NVIDIA Corporation. Nemotron-Personas-USA is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). Our subset is also released under CC BY 4.0, and all personas remain synthetic and non-identifying.
### Upstream Citation (as provided by NVIDIA)
```bibtex
@software{nvidia/Nemotron-Personas-USA,
author = {Meyer, Yev and Corneil, Dane},
title = {{Nemotron-Personas-USA}: Synthetic Personas Aligned to Real-World Distributions},
month = {June},
year = {2025},
url = {https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA}
}
```
提供机构:
anlee-0618



