gefeb28424/job-titles
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/gefeb28424/job-titles
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
size_categories:
- 10K<n<100K
task_categories:
- text-classification
- feature-extraction
pretty_name: Comprehensive Job Titles Dataset
tags:
- jobs
- occupations
- employment
- career
- human-resources
---
# Comprehensive Job Titles Dataset
A high-quality, deduplicated dataset of 65,248 unique job titles compiled from authoritative sources including ESCO (European Skills, Competences, Qualifications and Occupations), O*NET (Occupational Information Network), and OSCA (Occupational Skills and Competencies Australia).
## Dataset Description
This dataset provides a comprehensive collection of job titles that have been carefully processed to remove duplicates and near-duplicates using semantic similarity matching. It serves as a valuable resource for:
- **Job matching and recommendation systems**
- **Resume parsing and analysis**
- **Labor market research**
- **Career counseling applications**
- **HR technology development**
- **Natural language processing tasks related to employment**
## Dataset Structure
The dataset is provided in Parquet format with a single column:
- `job_title` (string): The standardized job title
### Example entries:
```
.NET Developer
2D Animation Artist
Accounting Clerk
Administrative Assistant
Agricultural Engineer
AI Research Scientist
Business Analyst
Chef
Data Scientist
```
## Sources
The dataset combines job titles from three major occupational classification systems:
1. **ESCO v1.2.0** (European Commission)
- ~33,000 occupations with multilingual support
- Includes preferred labels, alternative labels, and hidden labels
- Structured according to ISCO-08 classification
2. **O*NET Database v29.3** (U.S. Department of Labor)
- ~1,000 detailed occupational descriptions
- Comprehensive taxonomy of U.S. occupations
- Includes detailed job characteristics and requirements
3. **OSCA** (Australian Government)
- Australian occupational classifications
- Principal titles, alternative titles, and specializations
## Processing Pipeline
### 1. Extraction
Job titles were extracted from multiple source files:
- ESCO: Preferred labels and alternative labels from `occupations_en.csv`
- O*NET: Occupation titles from `Occupation Data.txt`
- OSCA: Principal titles and alternative titles from Excel files
### 2. Deduplication
A sophisticated deduplication process was applied:
- **Embedding Model**: `sentence-transformers/all-mpnet-base-v2`
- **Similarity Threshold**: 0.85 (cosine similarity)
- **Strategy**: Length-based blocking for efficiency
- **Preference**: Shorter titles retained (typically more general/common)
The deduplication process identified semantically similar job titles such as:
- "Software Developer" and "Software Engineer"
- "Administrative Assistant" and "Admin Assistant"
- "Customer Service Representative" and "Customer Service Rep"
### 3. Quality Control
- Removed exact duplicates (case-insensitive)
- Filtered out malformed entries
- Standardized formatting and capitalization
- Preserved diversity while eliminating redundancy
## Statistics
- **Total unique job titles**: 65,248
- **Original titles before deduplication**: ~100,000+
- **Reduction rate**: ~35% (semantic duplicates removed)
- **File size**: 756.7 KB (Parquet format with Snappy compression)
## Use Cases
### 1. Job Search and Matching
```python
import pandas as pd
# Load the dataset
df = pd.read_parquet('jobs.parquet')
# Search for data-related jobs
data_jobs = df[df['job_title'].str.contains('Data', case=False)]
```
### 2. Building Job Title Embeddings
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')
job_titles = df['job_title'].tolist()
embeddings = model.encode(job_titles)
```
### 3. Job Title Standardization
Use this dataset as a reference for standardizing job titles in your organization or application.
## Limitations
- **Language**: English only (though source data includes multilingual options)
- **Geographic bias**: Stronger coverage of European, U.S., and Australian job markets
- **Temporal**: Reflects job titles as of 2025; emerging roles may not be included
- **Granularity**: Some highly specific or niche job titles may have been merged during deduplication
## License
This dataset combines data from multiple sources, each with their own licensing:
- ESCO: European Union Public License (EUPL)
- O*NET: Public domain (U.S. Government work)
- OSCA: Creative Commons Attribution 3.0 Australia
Please review the original source licenses for commercial use.
## Citation
If you use this dataset in your research or applications, please cite:
```bibtex
@dataset{jobs_dataset_2025,
author = {Greg Priday},
title = {Comprehensive Job Titles Dataset},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/gpriday/jobs}
}
```
## Acknowledgments
This dataset builds upon the excellent work of:
- European Commission (ESCO)
- U.S. Department of Labor (O*NET)
- Australian Government (OSCA)
## Contact
For questions, suggestions, or contributions, please open an issue on the dataset repository.
---
许可证: CC BY 4.0
语言:
- 英语
规模类别:
- 10000 < 条目数 < 100000
任务类别:
- 文本分类
- 特征提取
美观名称: 综合职位名称数据集
标签:
- 职位
- 职业
- 就业
- 职业发展
- 人力资源
---
# 综合职位名称数据集
本数据集为经过去重处理的高质量数据集,共包含65248个唯一职位名称,数据源自三大权威来源:ESCO(欧洲技能、能力、资质与职业分类体系)、O*NET(美国职业信息网)以及OSCA(澳大利亚职业技能与能力体系)。
## 数据集描述
本数据集收录了经过严格处理的职位名称集合,通过语义相似度匹配算法移除了重复及近似重复条目,可广泛应用于以下场景:
- **职位匹配与推荐系统**
- **简历解析与分析**
- **劳动力市场研究**
- **职业咨询应用**
- **人力资源技术开发**
- **就业相关自然语言处理任务**
## 数据集结构
本数据集以Parquet格式存储,仅包含单列:
- `job_title`(字符串类型):标准化后的职位名称
### 示例条目:
.NET开发工程师
2D动画师
会计文员
行政助理
农业工程师
AI研究科学家
业务分析师
厨师
数据科学家
## 数据来源
本数据集整合了三大职业分类体系的职位名称:
1. **ESCO v1.2.0**(欧盟委员会)
- 包含约33000个职业,支持多语言
- 涵盖标准标签、替代标签与隐藏标签
- 按照ISCO-08职业分类体系构建
2. **O*NET数据库v29.3**(美国劳工部)
- 包含约1000个详细职业描述
- 覆盖美国职业的完整分类体系
- 包含详细的岗位特征与任职要求
3. **OSCA**(澳大利亚政府)
- 澳大利亚职业分类体系
- 涵盖标准标题、替代标题与职业细分方向
## 处理流程
### 1. 数据提取
职位名称从多个源文件中提取:
- ESCO:从`occupations_en.csv`中提取标准标签与替代标签
- O*NET:从`Occupation Data.txt`中提取职业名称
- OSCA:从Excel文件中提取标准标题与替代标题
### 2. 去重处理
本次采用了精细化的去重流程:
- **嵌入模型**:`sentence-transformers/all-mpnet-base-v2`
- **相似度阈值**:0.85(余弦相似度)
- **处理策略**:采用基于长度的分块法以提升效率
- **保留规则**:保留较短的职位名称(通常更通用、更常见)
本次去重流程识别出的语义近似职位示例包括:
- “软件开发工程师”与“软件工程师”
- “行政助理”与“行政专员”
- “客户服务代表”与“客服专员”
### 3. 质量管控
- 移除完全重复条目(不区分大小写)
- 过滤格式错误的条目
- 统一格式与大小写规范
- 在保留多样性的同时消除冗余
## 统计信息
- **唯一职位名称总数**:65248个
- **去重前原始条目数**:约100000+条
- **去重率**:约35%(移除语义重复条目)
- **文件大小**:756.7 KB(采用Snappy压缩的Parquet格式)
## 应用场景
### 1. 职位搜索与匹配
python
import pandas as pd
# 加载数据集
df = pd.read_parquet('jobs.parquet')
# 搜索与数据相关的职位
data_jobs = df[df['job_title'].str.contains('Data', case=False)]
### 2. 构建职位名称嵌入向量
python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')
job_titles = df['job_title'].tolist()
embeddings = model.encode(job_titles)
### 3. 职位名称标准化
可将本数据集作为参考,用于统一企业或应用内的职位名称规范。
## 局限性
- **语言限制**:仅支持英语(尽管源数据包含多语言选项)
- **地域偏向**:对欧洲、美国及澳大利亚的职业市场覆盖更为全面
- **时效性限制**:数据反映的是2025年的职位名称,新兴职业可能未被纳入
- **粒度限制**:部分高度细分或小众的职位名称可能在去重过程中被合并
## 许可证
本数据集整合了多个来源的数据,各来源拥有独立的许可证:
- ESCO:欧盟公共许可证(EUPL)
- O*NET:公有领域(美国政府作品)
- OSCA:知识共享署名3.0澳大利亚版
商业使用前请查阅各原始来源的许可证条款。
## 引用方式
若您在研究或应用中使用本数据集,请引用如下:
bibtex
@dataset{jobs_dataset_2025,
author = {Greg Priday},
title = {Comprehensive Job Titles Dataset},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/gpriday/jobs}
}
## 致谢
本数据集基于以下机构的出色工作构建:
- 欧盟委员会(ESCO)
- 美国劳工部(O*NET)
- 澳大利亚政府(OSCA)
## 联系方式
如有疑问、建议或贡献意向,请在数据集仓库中提交Issue。
提供机构:
gefeb28424



