anlee-0618/Nemotron-Personas-USA-10K

Name: anlee-0618/Nemotron-Personas-USA-10K
Creator: anlee-0618
Published: 2026-04-08 03:49:42
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/anlee-0618/Nemotron-Personas-USA-10K

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: uuid dtype: string - name: professional_persona dtype: string - name: sports_persona dtype: string - name: arts_persona dtype: string - name: travel_persona dtype: string - name: culinary_persona dtype: string - name: persona dtype: string - name: cultural_background dtype: string - name: skills_and_expertise dtype: string - name: skills_and_expertise_list dtype: string - name: hobbies_and_interests dtype: string - name: hobbies_and_interests_list dtype: string - name: career_goals_and_ambitions dtype: string - name: sex dtype: string - name: age dtype: int64 - name: marital_status dtype: string - name: education_level dtype: string - name: bachelors_field dtype: string - name: occupation dtype: string - name: city dtype: string - name: state dtype: string - name: zipcode dtype: string - name: country dtype: string splits: - name: train num_bytes: 53780592 num_examples: 10000 download_size: 26552972 dataset_size: 53780592 configs: - config_name: default data_files: - split: train path: data/train-* --- # Diverse Nemotron-Personas-USA 10K Subset A 10,000-sample, diversity-focused subset of NVIDIA’s **Nemotron-Personas-USA** synthetic persona dataset, curated to provide a compact but representative slice of U.S.-aligned synthetic personas for research and model development. --- ## 1. Dataset Summary - **Name:** Diverse Nemotron-Personas-USA 10K Subset - **Source dataset:** `nvidia/Nemotron-Personas-USA` - **Records:** 10,000 personas - **Modality:** Text - **Primary language:** English - **License:** CC BY 4.0 (see [License](#6-license) and [Attribution](#7-attribution)) - **Intended users:** Researchers, practitioners, and developers All personas are **synthetically generated**; any resemblance to real individuals is coincidental. --- ## 2. Origin and Relationship to Nemotron-Personas-USA This dataset is a **downstream curated subset** of the open-source [Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) dataset created by **NVIDIA Corporation**. Key points: - The original dataset contains **1M records** and **0.94B tokens** of synthetic U.S.-aligned personas. - Personas are grounded in **real-world demographic distributions** (e.g., age, geography, education, occupation) derived from U.S. Census data and related public statistical sources. - NVIDIA produced the original dataset using **NeMo Data Designer**, a compound AI system, and the `openai/gpt-oss-120b` model. This 10K subset: - Is **strictly sampled** from the original `nvidia/Nemotron-Personas-USA` dataset. - Does **not** introduce any new content beyond the original personas and fields (except for this documentation and any optional metadata added by the curator). - Inherits the **same license obligations** and **attribution requirements** as the source dataset (CC BY 4.0). For full details on the original dataset, consult the upstream dataset card on Hugging Face. --- ## 3. Sampling & Curation TODO --- ## 4. Data Fields The original Nemotron-Personas-USA dataset defines **22 fields** per record: **6 persona fields** and **16 contextual fields** (plus a unique identifier). This 10K subset preserves that structure unless otherwise noted. Common fields include (non-exhaustive illustrative list, matching the upstream schema): - **uuid:** Unique identifier for each persona (string). - **professional_persona:** Free-text description of the persona’s professional profile. - **sports_persona:** Free-text description of sports/fitness preferences and habits. - **arts_persona:** Free-text description of arts, culture, and creative interests. - **travel_persona:** Free-text description of travel style and preferences. - **culinary_persona:** Free-text description of food, cooking, and dining habits. - **persona:** High-level summary persona description. - **cultural_background:** Narrative description of cultural, regional, and identity context. - **skills_and_expertise:** Free-text overview of skills and expertise. - **skills_and_expertise_list:** List-like string of specific skills. - **hobbies_and_interests:** Free-text description of hobbies. - **hobbies_and_interests_list:** List-like string of hobbies. - **career_goals_and_ambitions:** Free-text description of medium/long-term goals. - **sex:** Sex assigned in the dataset (e.g., "Male", "Female"), based on Census-aligned distributions. - **age:** Integer age (adults only in the original dataset). - **marital_status:** Categorical marital status (e.g., `never_married`, `married_present`, `divorced`, `widowed`, etc.). - **education_level:** Categorical education level (e.g., `high_school`, `bachelors`, `graduate`, `some_college`, etc.). - **bachelors_field:** For some personas, field of bachelor’s degree (e.g., `stem`, `education`, `business`). - **occupation:** Categorical occupation label (over 560 occupations in the original dataset). - **city, state, zipcode, country:** Geographic context aligned with U.S. Census distributions. Consult the original Nemotron-Personas-USA dataset card for authoritative definitions of each field. Any deviations in this subset (e.g., dropped fields, additional metadata) should be documented in this section by the curator. --- ## 5. Intended Use & Limitations ### Intended Use This 10K subset is designed to be a **compact, diverse benchmark and training resource** for: - Evaluating persona-aware and instruction-following LLMs. - Testing **fairness, bias, and coverage** across demographic axes. - Simulating user populations for: - Recommendation and personalization experiments - Agent-based simulations - Synthetic evaluation of dialog systems - Rapid prototyping in environments where the full 2.6GB / 1M-record dataset is impractical. Both **commercial** and **non-commercial** uses are permitted, subject to the terms of the **CC BY 4.0** license and proper attribution. ### Limitations - **Synthetic data:** - All personas are *artificially generated*. They do not correspond to real individuals, even if names, professions, or locations resemble real-world entities. - **Model & seed data constraints:** - Persona distributions are grounded in U.S. Census and related statistical sources, but remain approximations subject to modeling assumptions (e.g., independence assumptions between some attributes). - **U.S.-centric:** - This dataset is **U.S.-focused** (plus Puerto Rico and Virgin Islands) and is not representative of global population distributions. - **Not ground truth for sensitive attributes:** - While useful for diversity and robustness testing, this dataset should not be treated as ground-truth demographic data for individuals or used for real-person profiling. Developers remain responsible for evaluating suitability, risk, and compliance in their specific domain (e.g., healthcare, finance, hiring). --- ## 6. License The original dataset `nvidia/Nemotron-Personas-USA` is licensed under: > **Creative Commons Attribution 4.0 International (CC BY 4.0)** This curated 10K subset is **also released under CC BY 4.0**, as required by the upstream license. In practice, this means: - You **may**: - Use, share, and adapt the dataset. - Use it for commercial and non-commercial purposes. - You **must**: - Provide **appropriate credit** to NVIDIA and the original dataset authors. - Indicate if you made changes (e.g., “sampled, filtered, or transformed”). - Include a link to the CC BY 4.0 license and, where practical, to the original dataset. For full legal terms, see: [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/) --- ## 7. Attribution When using this 10K subset in research, publications, or products, please: 1. Attribute both the **original dataset** and this **curated subset**. 2. Clearly indicate that your dataset is a **derived subset** of Nemotron-Personas-USA. 3. Include the official citation provided by NVIDIA. ### Recommended Attribution Text > This work uses a curated 10K subset derived from the Nemotron-Personas-USA dataset by NVIDIA Corporation. Nemotron-Personas-USA is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). Our subset is also released under CC BY 4.0, and all personas remain synthetic and non-identifying. ### Upstream Citation (as provided by NVIDIA) ```bibtex @software{nvidia/Nemotron-Personas-USA, author = {Meyer, Yev and Corneil, Dane}, title = {{Nemotron-Personas-USA}: Synthetic Personas Aligned to Real-World Distributions}, month = {June}, year = {2025}, url = {https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA} } ```

提供机构：

anlee-0618

5,000+

优质数据集

54 个

任务类型

进入经典数据集