KSE-RESEARCH-Group/Work_UA_vacancies
收藏Hugging Face2026-03-10 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/KSE-RESEARCH-Group/Work_UA_vacancies
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- uk
tags:
- vacancies
- job-postings
- nlp
- ukrainian
- job-market
- information-extraction
- structured-data
size_categories:
- 10K<n<100K
---
# WorkUA Job Vacancies Dataset
## Dataset Summary
This dataset contains **98,702 job vacancy entries** scraped from publicly available postings on [Work.ua](https://www.work.ua/), Ukraine's largest job portal. The data covers a snapshot of the Ukrainian labor market and includes job titles, full descriptions, salary information, required skills, company attributes, and employment conditions.
**Scraping window**: July – August 2025.
**Intended use:**
- Ukrainian-language NLP and text classification
- Job market and salary analysis
- Vacancy–resume matching
- Skills demand analysis
- Labor market research
> **Privacy**: Postings are employer-authored public listings. No candidate personal data is present. The `url` field has been removed. `company_id` is a derived integer identifier (see Schema notes).
---
## Processing Pipeline
| Stage | Description |
|---|---|
| **Scraping** | Vacancy pages scraped from Work.ua (publicly accessible employer listings) |
| **HTML → text** | Raw HTML parsed; `description_text` extracted as clean text from the job description body |
| **Skills** | Extracted directly from the structured skills widget on the page (platform-normalized tags) |
After cleaning: **98,702 unique vacancies**.
---
## File
**`vacancies.ndjson`** — Newline-delimited JSON, 98,702 rows, 15 columns.
### Schema
```
id String — Numeric vacancy ID from Work.ua
title String — Job position title (0.3% null)
salary String — Salary as displayed on the page, e.g. "25 000 грн" (28% null)
salary_details String — Additional salary text (bonuses, commission info, etc.) (51% null)
company_size String — Employer size category, e.g. "10–50 співробітників" (9% null)
location_details String — Location as displayed on the page (6% null)
is_remote Boolean — Whether remote work is indicated (5.5% true)
employment_type String — Raw employment type text, e.g. "Повна зайнятість" (non-null)
veteran_preference Boolean — Whether posting indicates veteran hiring preference (5.9% true)
is_hot Boolean — Work.ua "hot vacancy" flag (1.2% true)
salary_above_average Boolean — Work.ua flag: salary marked as above market average (28.5% true)
skills List[str] — Skill tags from the platform's structured skills widget
description_text String — Full job description as raw text (always present)
language String — Language requirements as scraped from the page (87% null)
company_id Int64 — Integer ID derived from company name via categorical encoding;
null if company name was missing (~0.3% null)
```
> **Note on `salary`**: The field is a raw string as shown on the page — formats vary ("від 20 000 грн", "20 000–30 000 грн", "за домовленістю", etc.). Parse before use.
> **Note on `company_id`**: This is **not** a Work.ua platform ID. It is derived by mapping `company_name` to a categorical integer. The same company name always maps to the same ID, but the mapping is not persisted in this file — do not treat it as a stable external identifier.
> **Note on `language`**: This field is scraped from a language requirements widget that is rarely filled (87% null). Values are raw text like `"Англійська — вище середнього"`. Not normalized.
---
## Data Example
```json
{
"id": "14821813",
"title": "Менеджер з продажу",
"salary": "25 000–40 000 грн",
"salary_details": null,
"company_size": "10–50 співробітників",
"location_details": "Київ",
"is_remote": false,
"employment_type": "Повна зайнятість",
"veteran_preference": false,
"is_hot": false,
"salary_above_average": true,
"skills": ["Переговори", "CRM", "Робота з клієнтами"],
"description_text": "Ми шукаємо активного менеджера з продажу...",
"language": null,
"company_id": 4821
}
```
---
## Known Limitations
| Issue | Detail |
|---|---|
| **Salary is unparsed free text** | Formats vary widely; requires regex/parsing before any quantitative analysis |
| **`salary_above_average` is platform-defined** | Work.ua determines this flag by its own internal market comparison — methodology unknown |
| **`employment_type` is non-standardized** | Raw text scraped from the page; includes whitespace artifacts and inconsistent phrasing |
| **`location_details` is non-standardized** | Can contain a city, full address, or multiple locations concatenated |
| **`language` is 87% null** | Field is optional and rarely filled by employers; absence does not mean no language requirements |
| **`company_id` is not stable** | Derived via categorical encoding from `company_name`; cannot be joined to any external source |
| **Point-in-time snapshot** | Reflects vacancies active during July–August 2025; does not track posting lifecycle or closures |
| **`description_text` noise** | Raw text from the DOM; may contain HTML entities, excessive whitespace, or emoji |
| **`skills` quality** | Platform-normalized but employer-chosen; reflects what employers wrote, not a curated taxonomy |
---
## Ethical Considerations
- All entries are **publicly posted employer listings**
- No candidate or applicant data is present
- Vacancy URLs have been removed from the published dataset
- Data reflects the Ukrainian labor market during the active war period (July–August 2025); the `veteran_preference` field and geographic distribution of listings are directly shaped by this context
---
## Language
Primarily **Ukrainian** (`uk`). Job descriptions are written by employers in Ukrainian; some postings contain English terms or phrases, particularly in IT-sector vacancies.
---
## License
[Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/)
Free to use for non-commercial research with attribution.
---
## Acknowledgments
- Data source: [Work.ua](https://www.work.ua/)
- Processing: [Polars](https://pola.rs/)
提供机构:
KSE-RESEARCH-Group



