five

herlambangharyoputro/indonesian-job-market-tokenized-2024

收藏
Hugging Face2025-12-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/herlambangharyoputro/indonesian-job-market-tokenized-2024
下载链接
链接失效反馈
官方服务:
资源简介:
这是印尼就业市场数据集的预处理和分词版本。基于原始数据集,该版本包括:文本清洗和规范化(印尼俚语→正式语言)、分词(针对职位、技能、描述、职责和资格)、技能分类(编程、前端、后端、数据库、云计算、数据科学、软技能)、位置解析(城市提取和远程工作检测)、从职位中提取工作级别(实习生/初级/中级/高级/经理/总监/高管)以及机器学习即用特征。数据集语言为印尼语(Bahasa Indonesia),包含约10,612条招聘信息,共38列(20原始列+18处理列),处理日期为2024年12月,使用了7种专门的印尼NLP分词器。

This is the preprocessed and tokenized version of the Indonesian Job Market Dataset. Building on the raw dataset, this version includes: text cleaning and normalization (Indonesian slang → formal language), tokenization for titles, skills, descriptions, responsibilities, and qualifications, skill categorization (programming, frontend, backend, database, cloud, data science, soft skills), location parsing with city extraction and remote work detection, job level extraction from titles (intern/entry/mid/senior/manager/director/executive), and ready-to-use features for machine learning. The dataset language is Indonesian (Bahasa Indonesia), containing approximately 10,612 job postings with 38 columns (20 original + 18 processed), processed in December 2024 using 7 specialized Indonesian NLP tokenizers.
提供机构:
herlambangharyoputro
二维码
社区交流群
二维码
科研交流群
商业服务