idhikavaidya/data_jobs
收藏Hugging Face2026-03-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/idhikavaidya/data_jobs
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# 🧠 data_jobs Dataset
A dataset of real-world data analytics job postings from 2023, collected and processed by Luke Barousse.
## Background
I've been collecting data on data job postings since 2022. I've been using a bot to scrape the data from Google, which come from a variety of sources.
You can find the full dataset at my app [datanerd.tech](https://datanerd.tech).
> [Serpapi](https://serpapi.com/) has kindly supported my work by providing me access to their API. Tell them I sent you and get 20% off paid plans.
## 📘 Data Dictionary
| Column Name | Description | Type | Source |
|-------------------------|-----------------------------------------------------------------------------|--------------|------------------|
| `job_title_short` | Cleaned/standardized job title using BERT model (10-class classification) | Calculated | From `job_title` |
| `job_title` | Full original job title as scraped | Raw | Scraped |
| `job_location` | Location string shown in job posting | Raw | Scraped |
| `job_via` | Platform the job was posted on (e.g., LinkedIn, Jobijoba) | Raw | Scraped |
| `job_schedule_type` | Type of schedule (Full-time, Part-time, Contractor, etc.) | Raw | Scraped |
| `job_work_from_home` | Whether the job is remote (`true`/`false`) | Boolean | Parsed |
| `search_location` | Location used by the bot to generate search queries | Generated | Bot logic |
| `job_posted_date` | Date and time when job was posted | Raw | Scraped |
| `job_no_degree_mention` | Whether the posting explicitly mentions no degree is required | Boolean | Parsed |
| `job_health_insurance` | Whether the job mentions health insurance | Boolean | Parsed |
| `job_country` | Country extracted from job location | Calculated | Parsed |
| `salary_rate` | Indicates if salary is annual or hourly | Raw | Scraped |
| `salary_year_avg` | Average yearly salary (calculated from salary ranges when available) | Calculated | Derived |
| `salary_hour_avg` | Average hourly salary (same logic as yearly) | Calculated | Derived |
| `company_name` | Company name listed in job posting | Raw | Scraped |
| `job_skills` | List of relevant skills extracted from job posting using PySpark | Parsed List | NLP Extracted |
| `job_type_skills` | Dictionary mapping skill types (e.g., 'cloud', 'libraries') to skill sets | Parsed Dict | NLP Extracted |
许可证:Apache-2.0
# 🧠 data_jobs 数据集
本数据集收录了2023年的真实数据分析岗位招聘信息,由Luke Barousse收集并处理。
## 背景
本人自2022年起便开始收集数据分析岗位招聘相关数据,通过自动化爬虫程序从谷歌平台抓取各类来源的招聘信息。
完整数据集可在我的应用[datanerd.tech](https://datanerd.tech)中获取。
> [Serpapi](https://serpapi.com/) 为我的研究提供了API访问支持,特此致谢。通过我推荐的链接注册可享付费套餐20%折扣。
## 📘 数据字典
| 列名 | 描述 | 类型 | 来源 |
|-------------------------|-----------------------------------------------------------------------------|--------------|------------------|
| `job_title_short` | 使用BERT(Bidirectional Encoder Representations from Transformers)模型完成10分类得到的清洗/标准化岗位名称 | 计算字段 | 源自`job_title`列 |
| `job_title` | 抓取得到的完整原始岗位名称 | 原始字段 | 爬虫抓取 |
| `job_location` | 招聘信息中展示的岗位所在地字符串 | 原始字段 | 爬虫抓取 |
| `job_via` | 岗位发布平台(例如领英(LinkedIn)、Jobijoba等) | 原始字段 | 爬虫抓取 |
| `job_schedule_type` | 用工排班类型(全职、兼职、合同工等) | 原始字段 | 爬虫抓取 |
| `job_work_from_home` | 该岗位是否支持远程办公(`true`/`false`) | 布尔值 | 解析处理 |
| `search_location` | 爬虫生成搜索查询时使用的检索地域 | 生成字段 | 爬虫逻辑 |
| `job_posted_date` | 岗位发布的日期与时间 | 原始字段 | 爬虫抓取 |
| `job_no_degree_mention` | 招聘信息是否明确提及无需学历要求 | 布尔值 | 解析处理 |
| `job_health_insurance` | 该岗位是否提及提供健康保险 | 布尔值 | 解析处理 |
| `job_country` | 从岗位所在地提取的所属国家 | 计算字段 | 解析处理 |
| `salary_rate` | 薪资结算周期标识(年薪或时薪) | 原始字段 | 爬虫抓取 |
| `salary_year_avg` | 平均年薪(基于可用薪资范围计算所得) | 计算字段 | 衍生得到 |
| `salary_hour_avg` | 平均时薪(计算逻辑与年薪一致) | 计算字段 | 衍生得到 |
| `company_name` | 招聘信息中列出的公司名称 | 原始字段 | 爬虫抓取 |
| `job_skills` | 通过PySpark(Python API for Apache Spark)从招聘信息中提取的相关技能列表 | 解析列表 | 自然语言处理(Natural Language Processing,简称NLP)抽取 |
| `job_type_skills` | 将技能类型(例如“云服务”“工具库”等)映射至技能集合的字典 | 解析字典 | 自然语言处理抽取 |
提供机构:
idhikavaidya



