five

KSE-RESEARCH-Group/Work_UA_resumes

收藏
Hugging Face2026-03-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/KSE-RESEARCH-Group/Work_UA_resumes
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - uk tags: - resumes - nlp - ukrainian - job-market - information-extraction - named-entity-recognition - structured-data size_categories: - 100K<n<1M --- # WorkUA Resumes Dataset ## Dataset Summary This dataset contains **103,895 structured resume entries** collected from publicly available candidate profiles on [Work.ua](https://www.work.ua/resumes/), Ukraine's largest job platform. Resumes were scraped, parsed, cleaned, and deduplicated for research use. **Scraping window**: July 9 – August 22, 2025. **Intended use:** - Resume parsing and information extraction - Ukrainian-language NLP pipelines - Vacancy–candidate matching - Labor market and salary analysis - Career recommendation systems - Text classification and semantic search > **Privacy**: All direct personal identifiers have been removed — candidate full name, profile URL, and contact details are not present. Only the numeric resume `id` is retained for cross-referencing purposes. --- ## Processing Pipeline Resumes were scraped as HTML, converted to Markdown, then structured fields were extracted in three batches: | Batch | Resumes | Method | Description | |---|---|---|---| | Standard | 84,245 | Regex | Well-structured resumes with a consistent HTML layout. Fields like `work_experiences`, `educations`, `skills`, `languages`, etc. were extracted directly via regular expressions from the page DOM. | | File-based | 14,397 | Gemini 2.5 Flash | Resumes uploaded by candidates as files (PDF, DOC, etc.). Work.ua renders a degraded text preview of these; the DOM-based regex approach was not viable. The full Markdown content was sent to Gemini with a strict JSON schema prompt (`temperature=0`) to extract all structured fields. | | Extended | 5,253 | Gemini 2.5 Flash | Standard resumes where structured fields (`educations`, `work_experiences`, `skills`, `languages`, `driver_license`, etc.) were absent from the DOM but present inside the free-text `additional_info` blob. Gemini was used to parse these out and populate the same schema. | After concatenation and deduplication: **103,895 unique resumes**. --- ## File **`resumes.ndjson`** — Newline-delimited JSON, 103,895 rows, 18 columns. ### Schema ``` id String — Numeric resume ID from Work.ua title String — Job title the candidate is seeking age Int64 — Candidate age (27% null) city String — Candidate's city desired_salary Int64 — Monthly salary expectation in UAH (15% null) employment_type String — e.g. "повна", "неповна" (47% null, inconsistently filled) work_location_preference String — e.g. "Дистанційно", "Офіс" (71% null, inconsistently filled) driver_license Boolean — Whether a driver's license is mentioned creation_date String — Date the resume was scraped (YYYY-MM-DD) other_resumes List[Struct] — Other resumes of the same candidate: {title, resume_id} veteran Boolean — Self-reported veteran status disability String — Disability group: "Перша/Друга/Третя група" or null work_experiences List[Struct] — {position, start_date, end_date, company, city, industry, responsibilities} languages List[Struct] — {language, level} skills List[String] — Free-text skill tags educations List[Struct] — {institution, faculty, city, level, start_year, end_year} additional_educations List[Struct] — Courses and certifications: {institution, start_year, end_year} additional_info String — Free-text remainder that did not fit structured fields (61% null) ``` --- ## Data Example ```json { "id": "14656003", "title": "Розробник WordPress", "age": 24, "city": "Київ", "desired_salary": 37000, "employment_type": null, "work_location_preference": "Дистанційно", "driver_license": false, "creation_date": "2025-08-17", "other_resumes": [], "veteran": false, "disability": null, "work_experiences": [ { "position": "Middle WordPress / Full-Stack Developer", "start_date": "2022-01-01", "end_date": "2025-01-01", "company": "CullyCully Studio", "city": "Швейцарія", "industry": null, "responsibilities": "Розробка комерційних сайтів та корпоративних сторінок..." } ], "languages": [ {"language": "Українська", "level": "рідна"}, {"language": "Англійська", "level": "середній"} ], "skills": ["WordPress", "WooCommerce", "PHP", "JavaScript"], "educations": [], "additional_educations": [], "additional_info": null } ``` --- ## Known Limitations | Issue | Detail | |---|---| | **High null rates in preference fields** | `work_location_preference` (71% null) and `employment_type` (47% null) were inconsistently filled by candidates — do not treat null as "not specified" | | **Self-reported salary** | Extreme values (e.g. 1 UAH, 400,000 UAH) are likely data entry errors; filter before use | | **Age anomalies** | Max reported age of 125 — filter to a reasonable range before use | | **Inconsistent date formats** | `work_experiences.start_date` / `end_date` vary: `"2022-01"`, `"2022-01-01"`, `"Present"`, `null` | | **Non-standardized free-text fields** | `employment_type`, `languages.level`, and `skills` are raw free-text with no normalization | | **Extraction quality split** | ~81% regex (deterministic), ~14% Gemini for file resumes, ~5% Gemini for additional info extraction — the LLM-extracted portion may have subtle inconsistencies | | **`skills` noise** | Mix of technical skills, soft skills, and personal traits with no taxonomy applied | | **Wartime context** | 43.7% of candidates self-reported veteran status — data reflects the Ukrainian labor market during the active war period (July–August 2025) | --- ## Ethical Considerations - All resumes were scraped from **publicly accessible** pages on Work.ua - **No contact information** (phone, email, address) is included - **Candidate full names and profile URLs** were removed prior to publication - **Recommenders' names** have been removed; the `recommendations` field is not included - Do not attempt to re-identify individuals from this data - Fields like `veteran` and `disability` reflect sensitive self-disclosed information and must be handled with care --- ## Language Primarily **Ukrainian** (`uk`). A minority of entries contain mixed Ukrainian/English text. No language filtering has been applied. --- ## License [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) Free to use for non-commercial research with attribution. --- ## Acknowledgments - Data source: [Work.ua](https://www.work.ua/) - AI extraction: Google Gemini 2.5 Flash - Processing: [Polars](https://pola.rs/)
提供机构:
KSE-RESEARCH-Group
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作