ShuoPang/synthetic-humans-1m
收藏Hugging Face2026-02-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ShuoPang/synthetic-humans-1m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
---
This dataset contains 1 million synthetic humans, sampled from actual US demographics. It is primarily meant to seed diverse LLM responses, but can be used for analytical purposes as well. The qualitivate_descriptions columns contains roughly 2.4 billion tokens, generated by `Qwen/QwQ-32B` with full reasoning traces.
A more detailed blog post on the methodology used to generate the dataset can be found here: https://www.skysight.inc/blog/synthetic-humans.
The dataset structure is as follows:
**Identifier**
- id: Unique identifier
**Demographics**
- age: age (18-100)
- gender: gender (M/F)
- location: city and state
- occupation_category: occupational category (job function)
- annual_wage: median annual wage in USD associated with occupation category (note: uses median annual wage of *all* occupation categories when not found)
**LLM Generated Descriptions**
- qualitative_descriptions: qualitivate descriptions generated by `Qwen/QwQ-32B`
**Structured Extractions from Descriptions**
All of the following columns are structured extractions from the `qualitivate_descriptions` column, performed primarily using `meta-llama/Llama-3.1-8B-Instruct` and `google/gemma-3-27b-it` in harder cases. These may not identically match the original text.
- demographic_summary
- background_story
- daily_life
- digital_behavior
- financial_situation
- values_and_beliefs
- challenges
- aspirations
- family_and_relationships
- personality
- political_beliefs
- education
提供机构:
ShuoPang



