five

yachty66/synthetic_residents_san_francisco

收藏
Hugging Face2024-05-07 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/yachty66/synthetic_residents_san_francisco
下载链接
链接失效反馈
官方服务:
资源简介:
# City Population Simulation This repository provides a dataset of synthetic residents of San Francisco. You can find all the related code in the following GitHub repository: https://github.com/yachty66/city_population_simulation. ## Method The dataset is created based on US Census Bureau data at [San Francisco County, California Census Data](https://data.census.gov/profile/San_Francisco_County,_California?g=050XX00US06075). In the first step, JSON objects representing the real data were created. The keys of a full object consist of age, gender, native_or_foreign_born, income, degree, employment, industry, and married_status. An example is: ```json { "age": 26, "gender": "Male", "native_or_foreign_born": "US citizen", "income": 164030, "degree": "graduate_or_professional_degree", "employment": "Employed", "industry": "Educational services, and health care and social assistance", "married_status": "Married" } ``` Not all objects contain the same keys due to the following conditional logic applied during their generation: - **Income and Education**: These attributes are only included for residents aged 25 and above. - **Employment**: Included for residents aged 16 and above. - **Marital Status**: Only considered for residents aged 18 and above. - **Industry**: Only considered if the resident is also employed. - **Degree**: This attribute is included for individuals aged 25 and above. This conditional inclusion of attributes ensures that the dataset realistically mirrors the varying availability of demographic data across different age groups. The created objects represent 1% of the real society of San Francisco to save money for the LLM use later. After all objects were created, an LLM is used to create descriptions for each individual JSON object based on its values. An example object with a simulated persona looks as follows: ```json { "age": 26, "gender": "Male", "native_or_foreign_born": "US citizen", "income": 164030, "degree": "graduate_or_professional_degree", "employment": "Employed", "industry": "Educational services, and health care and social assistance", "married_status": "Married", "description": "You are a 26-year-old male resident of San Francisco, USA, who is a US citizen with an income of $164,030. You hold a graduate or professional degree, are employed in the educational services, health care, and social assistance industry, and are married." } ```
提供机构:
yachty66
原始信息汇总

数据集概述

数据集名称

City Population Simulation

数据集内容

该数据集包含模拟的旧金山居民信息,每个居民的信息以JSON对象形式存储,包含以下属性:

  • 年龄
  • 性别
  • 出生地(本国或外国)
  • 收入(仅限25岁以上居民)
  • 学历(仅限25岁以上居民)
  • 就业状态(仅限16岁以上居民)
  • 行业(仅限就业居民)
  • 婚姻状态(仅限18岁以上居民)
  • 描述(基于上述属性生成的个人描述)

数据集生成方法

数据集基于美国人口普查局提供的旧金山县数据生成。首先,根据实际数据创建JSON对象,然后根据特定条件逻辑生成不同年龄段居民的属性信息。最后,使用大型语言模型为每个JSON对象生成基于其值的个人描述。

数据集规模

数据集代表旧金山实际社会人口的1%,用于节省后续使用大型语言模型的成本。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作