five

KSE-RESEARCH-Group/Work_UA_vacancies

收藏
Hugging Face2026-03-10 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/KSE-RESEARCH-Group/Work_UA_vacancies
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - uk tags: - vacancies - job-postings - nlp - ukrainian - job-market - information-extraction - structured-data size_categories: - 10K<n<100K --- # WorkUA Job Vacancies Dataset ## Dataset Summary This dataset contains **98,702 job vacancy entries** scraped from publicly available postings on [Work.ua](https://www.work.ua/), Ukraine's largest job portal. The data covers a snapshot of the Ukrainian labor market and includes job titles, full descriptions, salary information, required skills, company attributes, and employment conditions. **Scraping window**: July – August 2025. **Intended use:** - Ukrainian-language NLP and text classification - Job market and salary analysis - Vacancy–resume matching - Skills demand analysis - Labor market research > **Privacy**: Postings are employer-authored public listings. No candidate personal data is present. The `url` field has been removed. `company_id` is a derived integer identifier (see Schema notes). --- ## Processing Pipeline | Stage | Description | |---|---| | **Scraping** | Vacancy pages scraped from Work.ua (publicly accessible employer listings) | | **HTML → text** | Raw HTML parsed; `description_text` extracted as clean text from the job description body | | **Skills** | Extracted directly from the structured skills widget on the page (platform-normalized tags) | After cleaning: **98,702 unique vacancies**. --- ## File **`vacancies.ndjson`** — Newline-delimited JSON, 98,702 rows, 15 columns. ### Schema ``` id String — Numeric vacancy ID from Work.ua title String — Job position title (0.3% null) salary String — Salary as displayed on the page, e.g. "25 000 грн" (28% null) salary_details String — Additional salary text (bonuses, commission info, etc.) (51% null) company_size String — Employer size category, e.g. "10–50 співробітників" (9% null) location_details String — Location as displayed on the page (6% null) is_remote Boolean — Whether remote work is indicated (5.5% true) employment_type String — Raw employment type text, e.g. "Повна зайнятість" (non-null) veteran_preference Boolean — Whether posting indicates veteran hiring preference (5.9% true) is_hot Boolean — Work.ua "hot vacancy" flag (1.2% true) salary_above_average Boolean — Work.ua flag: salary marked as above market average (28.5% true) skills List[str] — Skill tags from the platform's structured skills widget description_text String — Full job description as raw text (always present) language String — Language requirements as scraped from the page (87% null) company_id Int64 — Integer ID derived from company name via categorical encoding; null if company name was missing (~0.3% null) ``` > **Note on `salary`**: The field is a raw string as shown on the page — formats vary ("від 20 000 грн", "20 000–30 000 грн", "за домовленістю", etc.). Parse before use. > **Note on `company_id`**: This is **not** a Work.ua platform ID. It is derived by mapping `company_name` to a categorical integer. The same company name always maps to the same ID, but the mapping is not persisted in this file — do not treat it as a stable external identifier. > **Note on `language`**: This field is scraped from a language requirements widget that is rarely filled (87% null). Values are raw text like `"Англійська — вище середнього"`. Not normalized. --- ## Data Example ```json { "id": "14821813", "title": "Менеджер з продажу", "salary": "25 000–40 000 грн", "salary_details": null, "company_size": "10–50 співробітників", "location_details": "Київ", "is_remote": false, "employment_type": "Повна зайнятість", "veteran_preference": false, "is_hot": false, "salary_above_average": true, "skills": ["Переговори", "CRM", "Робота з клієнтами"], "description_text": "Ми шукаємо активного менеджера з продажу...", "language": null, "company_id": 4821 } ``` --- ## Known Limitations | Issue | Detail | |---|---| | **Salary is unparsed free text** | Formats vary widely; requires regex/parsing before any quantitative analysis | | **`salary_above_average` is platform-defined** | Work.ua determines this flag by its own internal market comparison — methodology unknown | | **`employment_type` is non-standardized** | Raw text scraped from the page; includes whitespace artifacts and inconsistent phrasing | | **`location_details` is non-standardized** | Can contain a city, full address, or multiple locations concatenated | | **`language` is 87% null** | Field is optional and rarely filled by employers; absence does not mean no language requirements | | **`company_id` is not stable** | Derived via categorical encoding from `company_name`; cannot be joined to any external source | | **Point-in-time snapshot** | Reflects vacancies active during July–August 2025; does not track posting lifecycle or closures | | **`description_text` noise** | Raw text from the DOM; may contain HTML entities, excessive whitespace, or emoji | | **`skills` quality** | Platform-normalized but employer-chosen; reflects what employers wrote, not a curated taxonomy | --- ## Ethical Considerations - All entries are **publicly posted employer listings** - No candidate or applicant data is present - Vacancy URLs have been removed from the published dataset - Data reflects the Ukrainian labor market during the active war period (July–August 2025); the `veteran_preference` field and geographic distribution of listings are directly shaped by this context --- ## Language Primarily **Ukrainian** (`uk`). Job descriptions are written by employers in Ukrainian; some postings contain English terms or phrases, particularly in IT-sector vacancies. --- ## License [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) Free to use for non-commercial research with attribution. --- ## Acknowledgments - Data source: [Work.ua](https://www.work.ua/) - Processing: [Polars](https://pola.rs/)
提供机构:
KSE-RESEARCH-Group
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作