akzaidan/Job_Data_Parser

Name: akzaidan/Job_Data_Parser
Creator: akzaidan
Published: 2026-03-06 22:51:32
License: 暂无描述

Hugging Face2026-03-06 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/akzaidan/Job_Data_Parser

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: multilingual license: mit tags: - text-classification - job-posting - seniority - experience - salary - tabular - pandas task_categories: - text-classification pretty_name: Job Classification Dataset --- # Job Classification Dataset A dataset for classifying job postings by **expected years of experience** and **expected annual salary** (USD). Designed for training or evaluating models on seniority and compensation prediction from job description text. ## Dataset Description ### Dataset Summary This dataset contains job postings with: - **text**: Job posting content in the format `[LOCATION] ... [TITLE]: ... [DESC]: ...` (title and description concatenated) - **expected_experience_years**: Required years of experience (integer 0–20) - **expected_salary**: Expected annual salary in USD Missing values use `-1` as the sentinel. Text is truncated to 3,500 characters before labeling. ### Data Splits - `train.parquet`: Training data in Parquet format (~750,000 labeled rows) ### Data Fields | Column | Type | Description | |----------------------------|--------|--------------------------------------------------------------| | `text` | string | Job posting text: `[TITLE]: ... [DESC]: ...` | | `expected_experience_years`| int64 | Required years of experience (0–20); `-1` if missing | | `expected_salary` | int64 | Expected annual salary (USD); `-1` if missing | ## Dataset Creation Labels were produced programmatically using large language models (LLMs), not human annotation. The labeling pipeline: 1. **Source**: Parquet rows where any of `expected_experience_years` or `expected_salary` was missing (`-1` or `NaN`) 2. **Models**: GPT-4o-mini (75%) and Grok 4 fast (25%), temperature 0 3. **Tasks**: - **Years only** (when salary was already valid): Predict a single integer (0–20) for experience - **Years and salary** (when salary missing): Predict JSON with `years` and `expected_salary` 4. **Parsing**: Regex for years-only; JSON parsing for years+salary, with markdown code blocks stripped 5. **Retries**: Up to 4 attempts per row; rate limits handled with backoff Rules used in the prompts: - Explicit year mentions take priority over inferred seniority - In ambiguous cases, the model guesses the most likely number of years - Salary is an annual USD figure ## Intended Uses - Training classifiers to predict job seniority from text - Training or fine-tuning models for salary estimation from job postings - Benchmarking NLP models on structured information extraction from job ads - Research on labor market and compensation prediction ## Limitations - Labels are **model-generated**, not human-verified; they may reflect model biases and errors - Text is English-only (filtered during preprocessing) - Salary figures are annual USD; other currencies and payment types are not supported - Experience years are bucketed 0–20; "20+" is not distinguished - Some rows may remain with `-1` where parsing failed or retries exhausted ## Bias Considerations - LLM outputs can reproduce biases in training data (e.g., gender, industry, geography) - Salary predictions may reflect historical disparities and stereotypes - Job titles and wording may introduce selection bias - Use with caution in downstream applications involving hiring or compensation decisions ## Licensing This dataset is available under the MIT license. ## How to Load ```python import pandas as pd df = pd.read_parquet("train_data/train.parquet") # Filter to labeled rows labeled = df[df["expected_experience_years"] >= 0] ``` For Hugging Face Datasets: ```python from datasets import Dataset df = pd.read_parquet("train_data/train.parquet") dataset = Dataset.from_pandas(df) ```

提供机构：

akzaidan

5,000+

优质数据集

54 个

任务类型

进入经典数据集