KSE-RESEARCH-Group/Work_UA

Name: KSE-RESEARCH-Group/Work_UA
Creator: KSE-RESEARCH-Group
Published: 2025-11-29 12:47:51
License: 暂无描述

Hugging Face2025-11-29 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/KSE-RESEARCH-Group/Work_UA

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - uk size_categories: - 10K<n<100K --- # WorkUA Resumes Dataset ## Dataset Summary This dataset consists of **104,966 resume entries** collected from publicly available pages on [Work.ua](https://www.work.ua/resumes/). Each entry represents structured information extracted from a candidate's resume, including education, work experience, skills, languages, disability status, veteran status, driver license presence, and additional profile metadata. The dataset is designed for research and development of: - Resume parsing models - Information extraction systems - Vacancy-candidate matching algorithms - NLP pipelines for Ukrainian-language documents - Data engineering and ML training workflows - Career recommendation systems - Applicant ranking models All personally identifying information has been removed or anonymized. ## Dataset Structure The dataset is split into three files based on the extraction method and data complexity: ### File Descriptions | File | Rows | Description | Extraction Method | |------|------|-------------|-------------------| | `resumes_regex.ndjson` | 84,316 | Standard resumes with structured information | Regular expressions | | `resumes_files_gemini.ndjson` | 14,397 | Resumes uploaded as files (PDF, DOC, etc.) | Gemini 2.5 Flash | | `resumes_extended_gemini.ndjson` | 5,253 | Resumes with complex additional information | Gemini 2.5 Flash | **Total**: 104,966 resumes ### Data Processing Pipeline 1. **Source**: Resumes were scraped from Work.ua website where candidates publicly posted their information 2. **Format conversion**: Original HTML format was transformed into Markdown for easier processing 3. **Extraction methods**: - **Regex extraction**: Used for standard, well-structured resumes with consistent formatting - **Gemini 2.5 Flash**: Used for file-based uploads and resumes with complex, unstructured additional information that required AI-powered parsing ### Schema Overview All three files share the same schema with **21 fields**: ```python Schema([ ('id', String), ('url', String), ('title', String), ('candidate_name', String), ('age', Int64), ('city', String), ('desired_salary', Int64), ('employment_type', String), ('work_location_preference', String), ('driver_license', Boolean), ('creation_date', Datetime(time_unit='us', time_zone=None)), ('other_resumes', List(Struct({'title': String, 'url': String, 'resume_id': String, 'description': String}))), ('veteran', Boolean), ('disability', String), ('work_experiences', List(Struct({'position': String, 'start_date': String, 'end_date': String, 'company': String, 'city': String, 'industry': String, 'responsibilities': String}))), ('recommendations', List(Struct({'name': String, 'position': String}))), ('languages', List(Struct({'language': String, 'level': String}))), ('skills', List(String)), ('educations', List(Struct({'institution': String, 'faculty': String, 'city': String, 'level': String, 'start_year': Int64, 'end_year': Int64}))), ('additional_educations', List(Struct({'institution': String, 'start_year': Int64, 'end_year': Int64}))), ('additional_info', String) ]) ``` ### Nested Structure Details #### `work_experiences` - `position`: Job title - `start_date`: Start date - `end_date`: End date - `company`: Company name - `city`: Work location - `industry`: Industry sector - `responsibilities`: Job responsibilities description #### `educations` - `institution`: Educational institution name - `faculty`: Faculty/department name - `city`: Institution location - `level`: Education level (e.g., "Вища", "Середня спеціальна") - `start_year`: Year started - `end_year`: Year graduated #### `additional_educations` - `institution`: Training institution/course name - `start_year`: Year started - `end_year`: Year completed #### `languages` - `language`: Language name (e.g., "Українська", "English") - `level`: Proficiency level (e.g., "вільно", "базовий", "intermediate") #### `recommendations` - `name`: Recommender's name - `position`: Recommender's position/title #### `other_resumes` - `title`: Title of other resume by same candidate - `url`: URL to the other resume - `resume_id`: ID of the other resume - `description`: Brief description ## Data Example ```json { "id": "123456", "url": "https://www.work.ua/resumes/123456/", "title": "Будівельник", "candidate_name": "Іван", "age": 32, "city": "Київ", "desired_salary": 25000, "employment_type": "повна", "work_location_preference": "офіс", "driver_license": true, "creation_date": "2025-03-10T12:30:00", "veteran": false, "disability": null, "skills": ["Штукатурка", "Монтаж гіпсокартону"], "languages": [{"language": "Українська", "level": "вільно"}], "work_experiences": [ { "position": "Будівельник", "start_date": "2020-01", "end_date": "present", "company": "БудКомпанія", "city": "Київ", "industry": "Будівництво", "responsibilities": "Виконання будівельних робіт" } ], "educations": [ { "institution": "КНУБА", "faculty": "Промислове та цивільне будівництво", "city": "Київ", "level": "Вища", "start_year": 2012, "end_year": 2016 } ], "additional_educations": [], "recommendations": [], "other_resumes": [], "additional_info": "Готовий до відряджень." } ``` ## Loading the Dataset ### Using Polars ```python import polars as pl # Load individual files resumes_regex = pl.read_ndjson("resumes_regex.ndjson") resumes_files = pl.read_ndjson("resumes_files_gemini.ndjson") resumes_extended = pl.read_ndjson("resumes_extended_gemini.ndjson") # Combine all resumes all_resumes = pl.concat([resumes_regex, resumes_files, resumes_extended]) print(f"Total resumes: {len(all_resumes)}") ``` ### Using Pandas ```python import pandas as pd # Load individual files resumes_regex = pd.read_json("resumes_regex.ndjson", lines=True) resumes_files = pd.read_json("resumes_files_gemini.ndjson", lines=True) resumes_extended = pd.read_json("resumes_extended_gemini.ndjson", lines=True) # Combine all resumes all_resumes = pd.concat([resumes_regex, resumes_files, resumes_extended], ignore_index=True) print(f"Total resumes: {len(all_resumes)}") ``` ## Intended Use - **Resume parsing**: Train and evaluate resume parsing models - **Information extraction**: Develop IE systems for structured data extraction - **Semantic search**: Build search engines for candidate discovery - **Text classification**: Classify resumes by industry, job type, or skill level - **Matching algorithms**: Develop vacancy-candidate matching systems - **NLP research**: Study Ukrainian-language document processing - **Career analytics**: Analyze job market trends and skill requirements - **Recommendation systems**: Build career path and job recommendation engines ## Limitations - Some fields may be incomplete due to original document variability - Date formats may vary across entries (especially in `work_experiences`) - The quality of extraction may vary between regex-based and Gemini-based processing - Some `additional_info` fields may contain unstructured or noisy data - Language proficiency levels use non-standardized terms - Salary information may be missing for many entries ## Data Quality Notes ### Extraction Method Comparison - **Regex extraction** (`resumes_regex.ndjson`): - Higher consistency for well-structured fields - May miss information in non-standard formats - **Gemini extraction** (`resumes_files_gemini.ndjson`, `resumes_extended_gemini.ndjson`): - Better handling of complex, unstructured information - May introduce parsing variations - Can extract information from diverse file formats ## Ethical Considerations - All resumes were collected from publicly available pages on Work.ua - Personally identifying information has been removed or anonymized - Contact information (phone numbers, emails) is not included - The dataset should be used for research and development purposes only - Users should respect privacy and not attempt to re-identify individuals ## Language The dataset is primarily in **Ukrainian**, with some entries containing information in Russian or other languages. ## Acknowledgments - Data source: [Work.ua](https://www.work.ua/) - Extraction tools: Google Gemini 2.5 Flash for AI-powered parsing - Processing framework: Polars for efficient data handling

提供机构：

KSE-RESEARCH-Group

5,000+

优质数据集

54 个

任务类型

进入经典数据集