CareerCorpus : A Comprehensive Dataset of Annotated Resumes
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/wzzwn37gmd
下载链接
链接失效反馈官方服务:
资源简介:
CareerCorpus is a comprehensive dataset of 302 annotated resumes spanning six occupational categories designed for natural language processing research in automated recruitment and human resource analytics.
DATASET COMPOSITION:
- Total resumes: 302
- Categories: Teacher (50), Finance (50), Apparel (50), Accountant (51), Banking (50), Research Assistant (51)
- Format: Single Excel file (.xlsx) containing all six categories
- Annotation: Dual expert annotations preserved for all resumes
DATA SOURCES:
Resumes collected from (1) Kaggle dataset (LiveCareer.com professionally crafted resumes) for five categories, and (2) LinkedIn public profiles for Research Assistant category. HTML-formatted resumes processed via ChatGPT (GPT-5) for text extraction and standardization.
EXPERT ANNOTATION:
Each resume independently annotated by two domain experts:
- Financial categories (Finance, Accountant, Banking): Certified accountants with 5+ years experience and ICMAB certifications
- Apparel: Textile/fashion industry practitioners
- Academic categories (Teacher, Research Assistant): University lecturers with teaching and research experience
Dual annotations preserved to support soft-label training, annotation confidence modeling, and disagreement-aware evaluation metrics.
DATA PREPROCESSING:
- HTML-to-text conversion via AI-assisted summarization
- PII removal and anonymization (names, emails, phone numbers replaced with placeholders)
- Text normalization and standardization
- Duplicate elimination
- Format standardization across all categories
FILE STRUCTURE:
Single Excel workbook containing:
- All 302 resumes across six occupational categories
- Anonymized resume text
- Dual annotation scores from independent experts
- Category labels
- Resume metadata
- Organized in tabular format for easy access and analysis
INTER-ANNOTATOR AGREEMENT:
Pearson correlations range from 0.35-0.89 across categories (Finance: 0.68, Banking: 0.38, Accountant: 0.35, Apparel: 0.89, Teacher: 0.56, Research Assistant: 0.67). Overall mean correlation: 0.59, mean MAE: 0.106, indicating moderate agreement with low scoring error.
RESEARCH APPLICATIONS:
- Resume classification and categorization models
- Automated recruitment system development
- Skill extraction algorithms
- Job-candidate matching systems
- NLP benchmark evaluation
- Recruitment bias and fairness research
- Annotation quality and human-AI collaboration studies
ASSOCIATED PUBLICATION:
This dataset supports the Data in Brief article "CareerCorpus: A Comprehensive Dataset of Annotated Resumes" by Md Sagor Chowdhury, Adiba Fairooz Chowdhury, Ayesha Banu, and Riad Hossain (2025).
LICENSE:
Released under CC-BY-4.0 for open research use with appropriate citation.
CONTACT:
For questions: riad.h@eastdelta.edu.bd
Institution: Department of Computer Science and Engineering, East Delta University, Chattogram, Bangladesh
创建时间:
2025-12-05



