Name: idhikavaidya/data_jobs
Creator: idhikavaidya
Published: 2026-03-28 02:41:54
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/idhikavaidya/data_jobs

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 --- # 🧠 data_jobs Dataset A dataset of real-world data analytics job postings from 2023, collected and processed by Luke Barousse. ## Background I've been collecting data on data job postings since 2022. I've been using a bot to scrape the data from Google, which come from a variety of sources. You can find the full dataset at my app [datanerd.tech](https://datanerd.tech). > [Serpapi](https://serpapi.com/) has kindly supported my work by providing me access to their API. Tell them I sent you and get 20% off paid plans. ## 📘 Data Dictionary | Column Name | Description | Type | Source | |-------------------------|-----------------------------------------------------------------------------|--------------|------------------| | `job_title_short` | Cleaned/standardized job title using BERT model (10-class classification) | Calculated | From `job_title` | | `job_title` | Full original job title as scraped | Raw | Scraped | | `job_location` | Location string shown in job posting | Raw | Scraped | | `job_via` | Platform the job was posted on (e.g., LinkedIn, Jobijoba) | Raw | Scraped | | `job_schedule_type` | Type of schedule (Full-time, Part-time, Contractor, etc.) | Raw | Scraped | | `job_work_from_home` | Whether the job is remote (`true`/`false`) | Boolean | Parsed | | `search_location` | Location used by the bot to generate search queries | Generated | Bot logic | | `job_posted_date` | Date and time when job was posted | Raw | Scraped | | `job_no_degree_mention` | Whether the posting explicitly mentions no degree is required | Boolean | Parsed | | `job_health_insurance` | Whether the job mentions health insurance | Boolean | Parsed | | `job_country` | Country extracted from job location | Calculated | Parsed | | `salary_rate` | Indicates if salary is annual or hourly | Raw | Scraped | | `salary_year_avg` | Average yearly salary (calculated from salary ranges when available) | Calculated | Derived | | `salary_hour_avg` | Average hourly salary (same logic as yearly) | Calculated | Derived | | `company_name` | Company name listed in job posting | Raw | Scraped | | `job_skills` | List of relevant skills extracted from job posting using PySpark | Parsed List | NLP Extracted | | `job_type_skills` | Dictionary mapping skill types (e.g., 'cloud', 'libraries') to skill sets | Parsed Dict | NLP Extracted |

许可证：Apache-2.0 # 🧠 data_jobs 数据集本数据集收录了2023年的真实数据分析岗位招聘信息，由Luke Barousse收集并处理。 ## 背景本人自2022年起便开始收集数据分析岗位招聘相关数据，通过自动化爬虫程序从谷歌平台抓取各类来源的招聘信息。完整数据集可在我的应用[datanerd.tech](https://datanerd.tech)中获取。 > [Serpapi](https://serpapi.com/) 为我的研究提供了API访问支持，特此致谢。通过我推荐的链接注册可享付费套餐20%折扣。 ## 📘 数据字典 | 列名 | 描述 | 类型 | 来源 | |-------------------------|-----------------------------------------------------------------------------|--------------|------------------| | `job_title_short` | 使用BERT（Bidirectional Encoder Representations from Transformers）模型完成10分类得到的清洗/标准化岗位名称 | 计算字段 | 源自`job_title`列 | | `job_title` | 抓取得到的完整原始岗位名称 | 原始字段 | 爬虫抓取 | | `job_location` | 招聘信息中展示的岗位所在地字符串 | 原始字段 | 爬虫抓取 | | `job_via` | 岗位发布平台（例如领英（LinkedIn）、Jobijoba等） | 原始字段 | 爬虫抓取 | | `job_schedule_type` | 用工排班类型（全职、兼职、合同工等） | 原始字段 | 爬虫抓取 | | `job_work_from_home` | 该岗位是否支持远程办公（`true`/`false`） | 布尔值 | 解析处理 | | `search_location` | 爬虫生成搜索查询时使用的检索地域 | 生成字段 | 爬虫逻辑 | | `job_posted_date` | 岗位发布的日期与时间 | 原始字段 | 爬虫抓取 | | `job_no_degree_mention` | 招聘信息是否明确提及无需学历要求 | 布尔值 | 解析处理 | | `job_health_insurance` | 该岗位是否提及提供健康保险 | 布尔值 | 解析处理 | | `job_country` | 从岗位所在地提取的所属国家 | 计算字段 | 解析处理 | | `salary_rate` | 薪资结算周期标识（年薪或时薪） | 原始字段 | 爬虫抓取 | | `salary_year_avg` | 平均年薪（基于可用薪资范围计算所得） | 计算字段 | 衍生得到 | | `salary_hour_avg` | 平均时薪（计算逻辑与年薪一致） | 计算字段 | 衍生得到 | | `company_name` | 招聘信息中列出的公司名称 | 原始字段 | 爬虫抓取 | | `job_skills` | 通过PySpark（Python API for Apache Spark）从招聘信息中提取的相关技能列表 | 解析列表 | 自然语言处理（Natural Language Processing，简称NLP）抽取 | | `job_type_skills` | 将技能类型（例如“云服务”“工具库”等）映射至技能集合的字典 | 解析字典 | 自然语言处理抽取 |

应用场景：