five

atharva9967/data_jobs

收藏
Hugging Face2025-12-03 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/atharva9967/data_jobs
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 --- # 🧠 data_jobs Dataset A dataset of real-world data analytics job postings from 2023, collected and processed by Luke Barousse. ## Background I've been collecting data on data job postings since 2022. I've been using a bot to scrape the data from Google, which come from a variety of sources. You can find the full dataset at my app [datanerd.tech](https://datanerd.tech). > [Serpapi](https://serpapi.com/) has kindly supported my work by providing me access to their API. Tell them I sent you and get 20% off paid plans. ## 📘 Data Dictionary | Column Name | Description | Type | Source | |-------------------------|-----------------------------------------------------------------------------|--------------|------------------| | `job_title_short` | Cleaned/standardized job title using BERT model (10-class classification) | Calculated | From `job_title` | | `job_title` | Full original job title as scraped | Raw | Scraped | | `job_location` | Location string shown in job posting | Raw | Scraped | | `job_via` | Platform the job was posted on (e.g., LinkedIn, Jobijoba) | Raw | Scraped | | `job_schedule_type` | Type of schedule (Full-time, Part-time, Contractor, etc.) | Raw | Scraped | | `job_work_from_home` | Whether the job is remote (`true`/`false`) | Boolean | Parsed | | `search_location` | Location used by the bot to generate search queries | Generated | Bot logic | | `job_posted_date` | Date and time when job was posted | Raw | Scraped | | `job_no_degree_mention` | Whether the posting explicitly mentions no degree is required | Boolean | Parsed | | `job_health_insurance` | Whether the job mentions health insurance | Boolean | Parsed | | `job_country` | Country extracted from job location | Calculated | Parsed | | `salary_rate` | Indicates if salary is annual or hourly | Raw | Scraped | | `salary_year_avg` | Average yearly salary (calculated from salary ranges when available) | Calculated | Derived | | `salary_hour_avg` | Average hourly salary (same logic as yearly) | Calculated | Derived | | `company_name` | Company name listed in job posting | Raw | Scraped | | `job_skills` | List of relevant skills extracted from job posting using PySpark | Parsed List | NLP Extracted | | `job_type_skills` | Dictionary mapping skill types (e.g., 'cloud', 'libraries') to skill sets | Parsed Dict | NLP Extracted |

许可证:Apache 2.0 # 🧠 data_jobs 数据集 本数据集收录2023年真实数据分析岗位招聘信息,由Luke Barousse收集并处理。 ## 项目背景 笔者自2022年起便持续收集数据岗位招聘相关数据,通过自动化爬虫程序从谷歌平台抓取多源数据。完整数据集可在笔者的应用[datanerd.tech](https://datanerd.tech)获取。 > Serpapi 慷慨支持本项目,为笔者提供了其API的访问权限。通过笔者引荐使用其服务,购买付费套餐可享受20%的折扣。 ## 数据字典 | 列名 | 描述 | 数据类型 | 来源 | |--------------------------|----------------------------------------------------------------------|------------------|--------------------| | `job_title_short` | 通过BERT模型(BERT)完成清洗与标准化的精简岗位名称,用于10分类任务 | 计算生成 | 源自`job_title`字段 | | `job_title` | 抓取得到的原始完整岗位名称 | 原始数据 | 抓取所得 | | `job_location` | 招聘信息中展示的岗位地点字符串 | 原始数据 | 抓取所得 | | `job_via` | 岗位发布平台(例如领英(LinkedIn)、Jobijoba) | 原始数据 | 抓取所得 | | `job_schedule_type` | 用工排班类型(如全职、兼职、合同工等) | 原始数据 | 抓取所得 | | `job_work_from_home` | 该岗位是否支持远程办公(`true`/`false`) | 布尔型(Boolean) | 解析所得 | | `search_location` | 爬虫程序生成搜索查询时使用的搜索地点 | 生成字段 | 爬虫逻辑生成 | | `job_posted_date` | 岗位发布的日期与时间 | 原始数据 | 抓取所得 | | `job_no_degree_mention` | 招聘信息是否明确提及无需学历要求 | 布尔型(Boolean) | 解析所得 | | `job_health_insurance` | 岗位是否提及提供健康保险 | 布尔型(Boolean) | 解析所得 | | `job_country` | 从岗位地点字段提取得到的国家信息 | 计算生成 | 解析所得 | | `salary_rate` | 薪资结算周期标识(标注为年薪或时薪) | 原始数据 | 抓取所得 | | `salary_year_avg` | 平均年薪(根据可用薪资范围计算得到) | 计算生成 | 衍生所得 | | `salary_hour_avg` | 平均时薪(计算逻辑与年薪一致) | 计算生成 | 衍生所得 | | `company_name` | 招聘信息中列出的公司名称 | 原始数据 | 抓取所得 | | `job_skills` | 通过PySpark(PySpark)从招聘信息中提取的相关技能列表 | 解析列表 | 自然语言处理(NLP)提取 | | `job_type_skills` | 将技能类型(如云计算、第三方库等)映射至技能集合的字典 | 解析字典 | 自然语言处理(NLP)提取 |
提供机构:
atharva9967
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作