KSE-RESEARCH-Group/Work_UA_resumes
收藏Hugging Face2026-03-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/KSE-RESEARCH-Group/Work_UA_resumes
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- uk
tags:
- resumes
- nlp
- ukrainian
- job-market
- information-extraction
- named-entity-recognition
- structured-data
size_categories:
- 100K<n<1M
---
# WorkUA Resumes Dataset
## Dataset Summary
This dataset contains **103,895 structured resume entries** collected from publicly available candidate profiles on [Work.ua](https://www.work.ua/resumes/), Ukraine's largest job platform. Resumes were scraped, parsed, cleaned, and deduplicated for research use.
**Scraping window**: July 9 – August 22, 2025.
**Intended use:**
- Resume parsing and information extraction
- Ukrainian-language NLP pipelines
- Vacancy–candidate matching
- Labor market and salary analysis
- Career recommendation systems
- Text classification and semantic search
> **Privacy**: All direct personal identifiers have been removed — candidate full name, profile URL, and contact details are not present. Only the numeric resume `id` is retained for cross-referencing purposes.
---
## Processing Pipeline
Resumes were scraped as HTML, converted to Markdown, then structured fields were extracted in three batches:
| Batch | Resumes | Method | Description |
|---|---|---|---|
| Standard | 84,245 | Regex | Well-structured resumes with a consistent HTML layout. Fields like `work_experiences`, `educations`, `skills`, `languages`, etc. were extracted directly via regular expressions from the page DOM. |
| File-based | 14,397 | Gemini 2.5 Flash | Resumes uploaded by candidates as files (PDF, DOC, etc.). Work.ua renders a degraded text preview of these; the DOM-based regex approach was not viable. The full Markdown content was sent to Gemini with a strict JSON schema prompt (`temperature=0`) to extract all structured fields. |
| Extended | 5,253 | Gemini 2.5 Flash | Standard resumes where structured fields (`educations`, `work_experiences`, `skills`, `languages`, `driver_license`, etc.) were absent from the DOM but present inside the free-text `additional_info` blob. Gemini was used to parse these out and populate the same schema. |
After concatenation and deduplication: **103,895 unique resumes**.
---
## File
**`resumes.ndjson`** — Newline-delimited JSON, 103,895 rows, 18 columns.
### Schema
```
id String — Numeric resume ID from Work.ua
title String — Job title the candidate is seeking
age Int64 — Candidate age (27% null)
city String — Candidate's city
desired_salary Int64 — Monthly salary expectation in UAH (15% null)
employment_type String — e.g. "повна", "неповна" (47% null, inconsistently filled)
work_location_preference String — e.g. "Дистанційно", "Офіс" (71% null, inconsistently filled)
driver_license Boolean — Whether a driver's license is mentioned
creation_date String — Date the resume was scraped (YYYY-MM-DD)
other_resumes List[Struct] — Other resumes of the same candidate: {title, resume_id}
veteran Boolean — Self-reported veteran status
disability String — Disability group: "Перша/Друга/Третя група" or null
work_experiences List[Struct] — {position, start_date, end_date, company, city, industry, responsibilities}
languages List[Struct] — {language, level}
skills List[String] — Free-text skill tags
educations List[Struct] — {institution, faculty, city, level, start_year, end_year}
additional_educations List[Struct] — Courses and certifications: {institution, start_year, end_year}
additional_info String — Free-text remainder that did not fit structured fields (61% null)
```
---
## Data Example
```json
{
"id": "14656003",
"title": "Розробник WordPress",
"age": 24,
"city": "Київ",
"desired_salary": 37000,
"employment_type": null,
"work_location_preference": "Дистанційно",
"driver_license": false,
"creation_date": "2025-08-17",
"other_resumes": [],
"veteran": false,
"disability": null,
"work_experiences": [
{
"position": "Middle WordPress / Full-Stack Developer",
"start_date": "2022-01-01",
"end_date": "2025-01-01",
"company": "CullyCully Studio",
"city": "Швейцарія",
"industry": null,
"responsibilities": "Розробка комерційних сайтів та корпоративних сторінок..."
}
],
"languages": [
{"language": "Українська", "level": "рідна"},
{"language": "Англійська", "level": "середній"}
],
"skills": ["WordPress", "WooCommerce", "PHP", "JavaScript"],
"educations": [],
"additional_educations": [],
"additional_info": null
}
```
---
## Known Limitations
| Issue | Detail |
|---|---|
| **High null rates in preference fields** | `work_location_preference` (71% null) and `employment_type` (47% null) were inconsistently filled by candidates — do not treat null as "not specified" |
| **Self-reported salary** | Extreme values (e.g. 1 UAH, 400,000 UAH) are likely data entry errors; filter before use |
| **Age anomalies** | Max reported age of 125 — filter to a reasonable range before use |
| **Inconsistent date formats** | `work_experiences.start_date` / `end_date` vary: `"2022-01"`, `"2022-01-01"`, `"Present"`, `null` |
| **Non-standardized free-text fields** | `employment_type`, `languages.level`, and `skills` are raw free-text with no normalization |
| **Extraction quality split** | ~81% regex (deterministic), ~14% Gemini for file resumes, ~5% Gemini for additional info extraction — the LLM-extracted portion may have subtle inconsistencies |
| **`skills` noise** | Mix of technical skills, soft skills, and personal traits with no taxonomy applied |
| **Wartime context** | 43.7% of candidates self-reported veteran status — data reflects the Ukrainian labor market during the active war period (July–August 2025) |
---
## Ethical Considerations
- All resumes were scraped from **publicly accessible** pages on Work.ua
- **No contact information** (phone, email, address) is included
- **Candidate full names and profile URLs** were removed prior to publication
- **Recommenders' names** have been removed; the `recommendations` field is not included
- Do not attempt to re-identify individuals from this data
- Fields like `veteran` and `disability` reflect sensitive self-disclosed information and must be handled with care
---
## Language
Primarily **Ukrainian** (`uk`). A minority of entries contain mixed Ukrainian/English text. No language filtering has been applied.
---
## License
[Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/)
Free to use for non-commercial research with attribution.
---
## Acknowledgments
- Data source: [Work.ua](https://www.work.ua/)
- AI extraction: Google Gemini 2.5 Flash
- Processing: [Polars](https://pola.rs/)
提供机构:
KSE-RESEARCH-Group



