Name: atharva9967/data_jobs
Creator: atharva9967
Published: 2025-12-03 22:36:14
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/atharva9967/data_jobs

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 --- # 🧠 data_jobs Dataset A dataset of real-world data analytics job postings from 2023, collected and processed by Luke Barousse. ## Background I've been collecting data on data job postings since 2022. I've been using a bot to scrape the data from Google, which come from a variety of sources. You can find the full dataset at my app [datanerd.tech](https://datanerd.tech). > [Serpapi](https://serpapi.com/) has kindly supported my work by providing me access to their API. Tell them I sent you and get 20% off paid plans. ## 📘 Data Dictionary | Column Name | Description | Type | Source | |-------------------------|-----------------------------------------------------------------------------|--------------|------------------| | `job_title_short` | Cleaned/standardized job title using BERT model (10-class classification) | Calculated | From `job_title` | | `job_title` | Full original job title as scraped | Raw | Scraped | | `job_location` | Location string shown in job posting | Raw | Scraped | | `job_via` | Platform the job was posted on (e.g., LinkedIn, Jobijoba) | Raw | Scraped | | `job_schedule_type` | Type of schedule (Full-time, Part-time, Contractor, etc.) | Raw | Scraped | | `job_work_from_home` | Whether the job is remote (`true`/`false`) | Boolean | Parsed | | `search_location` | Location used by the bot to generate search queries | Generated | Bot logic | | `job_posted_date` | Date and time when job was posted | Raw | Scraped | | `job_no_degree_mention` | Whether the posting explicitly mentions no degree is required | Boolean | Parsed | | `job_health_insurance` | Whether the job mentions health insurance | Boolean | Parsed | | `job_country` | Country extracted from job location | Calculated | Parsed | | `salary_rate` | Indicates if salary is annual or hourly | Raw | Scraped | | `salary_year_avg` | Average yearly salary (calculated from salary ranges when available) | Calculated | Derived | | `salary_hour_avg` | Average hourly salary (same logic as yearly) | Calculated | Derived | | `company_name` | Company name listed in job posting | Raw | Scraped | | `job_skills` | List of relevant skills extracted from job posting using PySpark | Parsed List | NLP Extracted | | `job_type_skills` | Dictionary mapping skill types (e.g., 'cloud', 'libraries') to skill sets | Parsed Dict | NLP Extracted |

许可证：Apache 2.0 # 🧠 data_jobs 数据集本数据集收录2023年真实数据分析岗位招聘信息，由Luke Barousse收集并处理。 ## 项目背景笔者自2022年起便持续收集数据岗位招聘相关数据，通过自动化爬虫程序从谷歌平台抓取多源数据。完整数据集可在笔者的应用[datanerd.tech](https://datanerd.tech)获取。 > Serpapi 慷慨支持本项目，为笔者提供了其API的访问权限。通过笔者引荐使用其服务，购买付费套餐可享受20%的折扣。 ## 数据字典 | 列名 | 描述 | 数据类型 | 来源 | |--------------------------|----------------------------------------------------------------------|------------------|--------------------| | `job_title_short` | 通过BERT模型（BERT）完成清洗与标准化的精简岗位名称，用于10分类任务 | 计算生成 | 源自`job_title`字段 | | `job_title` | 抓取得到的原始完整岗位名称 | 原始数据 | 抓取所得 | | `job_location` | 招聘信息中展示的岗位地点字符串 | 原始数据 | 抓取所得 | | `job_via` | 岗位发布平台（例如领英(LinkedIn)、Jobijoba） | 原始数据 | 抓取所得 | | `job_schedule_type` | 用工排班类型（如全职、兼职、合同工等） | 原始数据 | 抓取所得 | | `job_work_from_home` | 该岗位是否支持远程办公（`true`/`false`） | 布尔型(Boolean) | 解析所得 | | `search_location` | 爬虫程序生成搜索查询时使用的搜索地点 | 生成字段 | 爬虫逻辑生成 | | `job_posted_date` | 岗位发布的日期与时间 | 原始数据 | 抓取所得 | | `job_no_degree_mention` | 招聘信息是否明确提及无需学历要求 | 布尔型(Boolean) | 解析所得 | | `job_health_insurance` | 岗位是否提及提供健康保险 | 布尔型(Boolean) | 解析所得 | | `job_country` | 从岗位地点字段提取得到的国家信息 | 计算生成 | 解析所得 | | `salary_rate` | 薪资结算周期标识（标注为年薪或时薪） | 原始数据 | 抓取所得 | | `salary_year_avg` | 平均年薪（根据可用薪资范围计算得到） | 计算生成 | 衍生所得 | | `salary_hour_avg` | 平均时薪（计算逻辑与年薪一致） | 计算生成 | 衍生所得 | | `company_name` | 招聘信息中列出的公司名称 | 原始数据 | 抓取所得 | | `job_skills` | 通过PySpark（PySpark）从招聘信息中提取的相关技能列表 | 解析列表 | 自然语言处理(NLP)提取 | | `job_type_skills` | 将技能类型（如云计算、第三方库等）映射至技能集合的字典 | 解析字典 | 自然语言处理(NLP)提取 |

应用场景：