atharva9967/data_jobs
收藏Hugging Face2025-12-03 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/atharva9967/data_jobs
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# 🧠 data_jobs Dataset
A dataset of real-world data analytics job postings from 2023, collected and processed by Luke Barousse.
## Background
I've been collecting data on data job postings since 2022. I've been using a bot to scrape the data from Google, which come from a variety of sources.
You can find the full dataset at my app [datanerd.tech](https://datanerd.tech).
> [Serpapi](https://serpapi.com/) has kindly supported my work by providing me access to their API. Tell them I sent you and get 20% off paid plans.
## 📘 Data Dictionary
| Column Name | Description | Type | Source |
|-------------------------|-----------------------------------------------------------------------------|--------------|------------------|
| `job_title_short` | Cleaned/standardized job title using BERT model (10-class classification) | Calculated | From `job_title` |
| `job_title` | Full original job title as scraped | Raw | Scraped |
| `job_location` | Location string shown in job posting | Raw | Scraped |
| `job_via` | Platform the job was posted on (e.g., LinkedIn, Jobijoba) | Raw | Scraped |
| `job_schedule_type` | Type of schedule (Full-time, Part-time, Contractor, etc.) | Raw | Scraped |
| `job_work_from_home` | Whether the job is remote (`true`/`false`) | Boolean | Parsed |
| `search_location` | Location used by the bot to generate search queries | Generated | Bot logic |
| `job_posted_date` | Date and time when job was posted | Raw | Scraped |
| `job_no_degree_mention` | Whether the posting explicitly mentions no degree is required | Boolean | Parsed |
| `job_health_insurance` | Whether the job mentions health insurance | Boolean | Parsed |
| `job_country` | Country extracted from job location | Calculated | Parsed |
| `salary_rate` | Indicates if salary is annual or hourly | Raw | Scraped |
| `salary_year_avg` | Average yearly salary (calculated from salary ranges when available) | Calculated | Derived |
| `salary_hour_avg` | Average hourly salary (same logic as yearly) | Calculated | Derived |
| `company_name` | Company name listed in job posting | Raw | Scraped |
| `job_skills` | List of relevant skills extracted from job posting using PySpark | Parsed List | NLP Extracted |
| `job_type_skills` | Dictionary mapping skill types (e.g., 'cloud', 'libraries') to skill sets | Parsed Dict | NLP Extracted |
许可证:Apache 2.0
# 🧠 data_jobs 数据集
本数据集收录2023年真实数据分析岗位招聘信息,由Luke Barousse收集并处理。
## 项目背景
笔者自2022年起便持续收集数据岗位招聘相关数据,通过自动化爬虫程序从谷歌平台抓取多源数据。完整数据集可在笔者的应用[datanerd.tech](https://datanerd.tech)获取。
> Serpapi 慷慨支持本项目,为笔者提供了其API的访问权限。通过笔者引荐使用其服务,购买付费套餐可享受20%的折扣。
## 数据字典
| 列名 | 描述 | 数据类型 | 来源 |
|--------------------------|----------------------------------------------------------------------|------------------|--------------------|
| `job_title_short` | 通过BERT模型(BERT)完成清洗与标准化的精简岗位名称,用于10分类任务 | 计算生成 | 源自`job_title`字段 |
| `job_title` | 抓取得到的原始完整岗位名称 | 原始数据 | 抓取所得 |
| `job_location` | 招聘信息中展示的岗位地点字符串 | 原始数据 | 抓取所得 |
| `job_via` | 岗位发布平台(例如领英(LinkedIn)、Jobijoba) | 原始数据 | 抓取所得 |
| `job_schedule_type` | 用工排班类型(如全职、兼职、合同工等) | 原始数据 | 抓取所得 |
| `job_work_from_home` | 该岗位是否支持远程办公(`true`/`false`) | 布尔型(Boolean) | 解析所得 |
| `search_location` | 爬虫程序生成搜索查询时使用的搜索地点 | 生成字段 | 爬虫逻辑生成 |
| `job_posted_date` | 岗位发布的日期与时间 | 原始数据 | 抓取所得 |
| `job_no_degree_mention` | 招聘信息是否明确提及无需学历要求 | 布尔型(Boolean) | 解析所得 |
| `job_health_insurance` | 岗位是否提及提供健康保险 | 布尔型(Boolean) | 解析所得 |
| `job_country` | 从岗位地点字段提取得到的国家信息 | 计算生成 | 解析所得 |
| `salary_rate` | 薪资结算周期标识(标注为年薪或时薪) | 原始数据 | 抓取所得 |
| `salary_year_avg` | 平均年薪(根据可用薪资范围计算得到) | 计算生成 | 衍生所得 |
| `salary_hour_avg` | 平均时薪(计算逻辑与年薪一致) | 计算生成 | 衍生所得 |
| `company_name` | 招聘信息中列出的公司名称 | 原始数据 | 抓取所得 |
| `job_skills` | 通过PySpark(PySpark)从招聘信息中提取的相关技能列表 | 解析列表 | 自然语言处理(NLP)提取 |
| `job_type_skills` | 将技能类型(如云计算、第三方库等)映射至技能集合的字典 | 解析字典 | 自然语言处理(NLP)提取 |
提供机构:
atharva9967



