five

CareerCorpus : A Comprehensive Dataset of Annotated Resumes

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/wzzwn37gmd
下载链接
链接失效反馈
官方服务:
资源简介:
CareerCorpus is a comprehensive dataset of 302 annotated resumes spanning six occupational categories designed for natural language processing research in automated recruitment and human resource analytics. DATASET COMPOSITION: - Total resumes: 302 - Categories: Teacher (50), Finance (50), Apparel (50), Accountant (51), Banking (50), Research Assistant (51) - Format: Single Excel file (.xlsx) containing all six categories - Annotation: Dual expert annotations preserved for all resumes DATA SOURCES: Resumes collected from (1) Kaggle dataset (LiveCareer.com professionally crafted resumes) for five categories, and (2) LinkedIn public profiles for Research Assistant category. HTML-formatted resumes processed via ChatGPT (GPT-5) for text extraction and standardization. EXPERT ANNOTATION: Each resume independently annotated by two domain experts: - Financial categories (Finance, Accountant, Banking): Certified accountants with 5+ years experience and ICMAB certifications - Apparel: Textile/fashion industry practitioners - Academic categories (Teacher, Research Assistant): University lecturers with teaching and research experience Dual annotations preserved to support soft-label training, annotation confidence modeling, and disagreement-aware evaluation metrics. DATA PREPROCESSING: - HTML-to-text conversion via AI-assisted summarization - PII removal and anonymization (names, emails, phone numbers replaced with placeholders) - Text normalization and standardization - Duplicate elimination - Format standardization across all categories FILE STRUCTURE: Single Excel workbook containing: - All 302 resumes across six occupational categories - Anonymized resume text - Dual annotation scores from independent experts - Category labels - Resume metadata - Organized in tabular format for easy access and analysis INTER-ANNOTATOR AGREEMENT: Pearson correlations range from 0.35-0.89 across categories (Finance: 0.68, Banking: 0.38, Accountant: 0.35, Apparel: 0.89, Teacher: 0.56, Research Assistant: 0.67). Overall mean correlation: 0.59, mean MAE: 0.106, indicating moderate agreement with low scoring error. RESEARCH APPLICATIONS: - Resume classification and categorization models - Automated recruitment system development - Skill extraction algorithms - Job-candidate matching systems - NLP benchmark evaluation - Recruitment bias and fairness research - Annotation quality and human-AI collaboration studies ASSOCIATED PUBLICATION: This dataset supports the Data in Brief article "CareerCorpus: A Comprehensive Dataset of Annotated Resumes" by Md Sagor Chowdhury, Adiba Fairooz Chowdhury, Ayesha Banu, and Riad Hossain (2025). LICENSE: Released under CC-BY-4.0 for open research use with appropriate citation. CONTACT: For questions: riad.h@eastdelta.edu.bd Institution: Department of Computer Science and Engineering, East Delta University, Chattogram, Bangladesh
创建时间:
2025-12-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作