five

gefeb28424/job-titles

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/gefeb28424/job-titles
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en size_categories: - 10K<n<100K task_categories: - text-classification - feature-extraction pretty_name: Comprehensive Job Titles Dataset tags: - jobs - occupations - employment - career - human-resources --- # Comprehensive Job Titles Dataset A high-quality, deduplicated dataset of 65,248 unique job titles compiled from authoritative sources including ESCO (European Skills, Competences, Qualifications and Occupations), O*NET (Occupational Information Network), and OSCA (Occupational Skills and Competencies Australia). ## Dataset Description This dataset provides a comprehensive collection of job titles that have been carefully processed to remove duplicates and near-duplicates using semantic similarity matching. It serves as a valuable resource for: - **Job matching and recommendation systems** - **Resume parsing and analysis** - **Labor market research** - **Career counseling applications** - **HR technology development** - **Natural language processing tasks related to employment** ## Dataset Structure The dataset is provided in Parquet format with a single column: - `job_title` (string): The standardized job title ### Example entries: ``` .NET Developer 2D Animation Artist Accounting Clerk Administrative Assistant Agricultural Engineer AI Research Scientist Business Analyst Chef Data Scientist ``` ## Sources The dataset combines job titles from three major occupational classification systems: 1. **ESCO v1.2.0** (European Commission) - ~33,000 occupations with multilingual support - Includes preferred labels, alternative labels, and hidden labels - Structured according to ISCO-08 classification 2. **O*NET Database v29.3** (U.S. Department of Labor) - ~1,000 detailed occupational descriptions - Comprehensive taxonomy of U.S. occupations - Includes detailed job characteristics and requirements 3. **OSCA** (Australian Government) - Australian occupational classifications - Principal titles, alternative titles, and specializations ## Processing Pipeline ### 1. Extraction Job titles were extracted from multiple source files: - ESCO: Preferred labels and alternative labels from `occupations_en.csv` - O*NET: Occupation titles from `Occupation Data.txt` - OSCA: Principal titles and alternative titles from Excel files ### 2. Deduplication A sophisticated deduplication process was applied: - **Embedding Model**: `sentence-transformers/all-mpnet-base-v2` - **Similarity Threshold**: 0.85 (cosine similarity) - **Strategy**: Length-based blocking for efficiency - **Preference**: Shorter titles retained (typically more general/common) The deduplication process identified semantically similar job titles such as: - "Software Developer" and "Software Engineer" - "Administrative Assistant" and "Admin Assistant" - "Customer Service Representative" and "Customer Service Rep" ### 3. Quality Control - Removed exact duplicates (case-insensitive) - Filtered out malformed entries - Standardized formatting and capitalization - Preserved diversity while eliminating redundancy ## Statistics - **Total unique job titles**: 65,248 - **Original titles before deduplication**: ~100,000+ - **Reduction rate**: ~35% (semantic duplicates removed) - **File size**: 756.7 KB (Parquet format with Snappy compression) ## Use Cases ### 1. Job Search and Matching ```python import pandas as pd # Load the dataset df = pd.read_parquet('jobs.parquet') # Search for data-related jobs data_jobs = df[df['job_title'].str.contains('Data', case=False)] ``` ### 2. Building Job Title Embeddings ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-mpnet-base-v2') job_titles = df['job_title'].tolist() embeddings = model.encode(job_titles) ``` ### 3. Job Title Standardization Use this dataset as a reference for standardizing job titles in your organization or application. ## Limitations - **Language**: English only (though source data includes multilingual options) - **Geographic bias**: Stronger coverage of European, U.S., and Australian job markets - **Temporal**: Reflects job titles as of 2025; emerging roles may not be included - **Granularity**: Some highly specific or niche job titles may have been merged during deduplication ## License This dataset combines data from multiple sources, each with their own licensing: - ESCO: European Union Public License (EUPL) - O*NET: Public domain (U.S. Government work) - OSCA: Creative Commons Attribution 3.0 Australia Please review the original source licenses for commercial use. ## Citation If you use this dataset in your research or applications, please cite: ```bibtex @dataset{jobs_dataset_2025, author = {Greg Priday}, title = {Comprehensive Job Titles Dataset}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/gpriday/jobs} } ``` ## Acknowledgments This dataset builds upon the excellent work of: - European Commission (ESCO) - U.S. Department of Labor (O*NET) - Australian Government (OSCA) ## Contact For questions, suggestions, or contributions, please open an issue on the dataset repository.

--- 许可证: CC BY 4.0 语言: - 英语 规模类别: - 10000 < 条目数 < 100000 任务类别: - 文本分类 - 特征提取 美观名称: 综合职位名称数据集 标签: - 职位 - 职业 - 就业 - 职业发展 - 人力资源 --- # 综合职位名称数据集 本数据集为经过去重处理的高质量数据集,共包含65248个唯一职位名称,数据源自三大权威来源:ESCO(欧洲技能、能力、资质与职业分类体系)、O*NET(美国职业信息网)以及OSCA(澳大利亚职业技能与能力体系)。 ## 数据集描述 本数据集收录了经过严格处理的职位名称集合,通过语义相似度匹配算法移除了重复及近似重复条目,可广泛应用于以下场景: - **职位匹配与推荐系统** - **简历解析与分析** - **劳动力市场研究** - **职业咨询应用** - **人力资源技术开发** - **就业相关自然语言处理任务** ## 数据集结构 本数据集以Parquet格式存储,仅包含单列: - `job_title`(字符串类型):标准化后的职位名称 ### 示例条目: .NET开发工程师 2D动画师 会计文员 行政助理 农业工程师 AI研究科学家 业务分析师 厨师 数据科学家 ## 数据来源 本数据集整合了三大职业分类体系的职位名称: 1. **ESCO v1.2.0**(欧盟委员会) - 包含约33000个职业,支持多语言 - 涵盖标准标签、替代标签与隐藏标签 - 按照ISCO-08职业分类体系构建 2. **O*NET数据库v29.3**(美国劳工部) - 包含约1000个详细职业描述 - 覆盖美国职业的完整分类体系 - 包含详细的岗位特征与任职要求 3. **OSCA**(澳大利亚政府) - 澳大利亚职业分类体系 - 涵盖标准标题、替代标题与职业细分方向 ## 处理流程 ### 1. 数据提取 职位名称从多个源文件中提取: - ESCO:从`occupations_en.csv`中提取标准标签与替代标签 - O*NET:从`Occupation Data.txt`中提取职业名称 - OSCA:从Excel文件中提取标准标题与替代标题 ### 2. 去重处理 本次采用了精细化的去重流程: - **嵌入模型**:`sentence-transformers/all-mpnet-base-v2` - **相似度阈值**:0.85(余弦相似度) - **处理策略**:采用基于长度的分块法以提升效率 - **保留规则**:保留较短的职位名称(通常更通用、更常见) 本次去重流程识别出的语义近似职位示例包括: - “软件开发工程师”与“软件工程师” - “行政助理”与“行政专员” - “客户服务代表”与“客服专员” ### 3. 质量管控 - 移除完全重复条目(不区分大小写) - 过滤格式错误的条目 - 统一格式与大小写规范 - 在保留多样性的同时消除冗余 ## 统计信息 - **唯一职位名称总数**:65248个 - **去重前原始条目数**:约100000+条 - **去重率**:约35%(移除语义重复条目) - **文件大小**:756.7 KB(采用Snappy压缩的Parquet格式) ## 应用场景 ### 1. 职位搜索与匹配 python import pandas as pd # 加载数据集 df = pd.read_parquet('jobs.parquet') # 搜索与数据相关的职位 data_jobs = df[df['job_title'].str.contains('Data', case=False)] ### 2. 构建职位名称嵌入向量 python from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-mpnet-base-v2') job_titles = df['job_title'].tolist() embeddings = model.encode(job_titles) ### 3. 职位名称标准化 可将本数据集作为参考,用于统一企业或应用内的职位名称规范。 ## 局限性 - **语言限制**:仅支持英语(尽管源数据包含多语言选项) - **地域偏向**:对欧洲、美国及澳大利亚的职业市场覆盖更为全面 - **时效性限制**:数据反映的是2025年的职位名称,新兴职业可能未被纳入 - **粒度限制**:部分高度细分或小众的职位名称可能在去重过程中被合并 ## 许可证 本数据集整合了多个来源的数据,各来源拥有独立的许可证: - ESCO:欧盟公共许可证(EUPL) - O*NET:公有领域(美国政府作品) - OSCA:知识共享署名3.0澳大利亚版 商业使用前请查阅各原始来源的许可证条款。 ## 引用方式 若您在研究或应用中使用本数据集,请引用如下: bibtex @dataset{jobs_dataset_2025, author = {Greg Priday}, title = {Comprehensive Job Titles Dataset}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/gpriday/jobs} } ## 致谢 本数据集基于以下机构的出色工作构建: - 欧盟委员会(ESCO) - 美国劳工部(O*NET) - 澳大利亚政府(OSCA) ## 联系方式 如有疑问、建议或贡献意向,请在数据集仓库中提交Issue。
提供机构:
gefeb28424
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作