Mehyaar/Annotated_NER_PDF_Resumes
收藏IT Skills Named Entity Recognition (NER) Dataset
描述
该数据集包含5,029份简历样本,每份简历都使用**命名实体识别(NER)**标注了IT技能。技能是手动标注并从PDF中提取的,数据以JSON格式提供。该数据集非常适合用于训练和评估NER模型,特别是从简历中提取IT技能。
亮点
- 5,029份简历样本,标注了IT技能
- 手动标注IT技能,使用命名实体识别(NER)
- 从PDF中提取文本并标注IT技能
- JSON格式,便于与NLP工具(如Spacy)集成
- 优秀的资源,用于训练和评估IT技能提取的NER模型
数据集详情
- 总简历数: 5,029
- 数据格式: JSON文件
- 标注: 使用命名实体识别标注的IT技能
数据描述
每个JSON文件包含以下字段:
| 字段 | 描述 |
|---|---|
text |
从简历PDF中提取的文本 |
annotations |
在文本中标注的IT技能列表,每个标注包括: |
start: 技能在文本中的起始位置(零基索引)end: 技能在文本中的结束位置(零基索引,不包括)label: 实体类型(IT技能)
示例JSON文件
以下是数据集中使用的JSON结构示例:
json { "text": "One97 Communications Limited Data Scientist Jan 2019 to Till Date Detect important information from images and redact required fields. YOLO CNN Object-detection, OCR Insights, find anomaly or performance drop in all possible sub-space. Predict the Insurance claim probability. Estimate the premium amount to be charged B.Tech(Computer Science) from SGBAU university in 2017. M.Tech (Computer Science Engineering) from Indian Institute of Technology (IIT), Kanpur in 2019WORK EXPERIENCE EDUCATIONMACY WILLIAMS DATA SCIENTIST Data Scientist working on problems related to market research and customer analysis. I want to expand my arsenal of application building and work on different kinds of problems. Looking for a role where I can work with a coordinative team and exchange knowledge during the process. Java, C++, Python, Machine Learning, Algorithms, Natural Language Processing, Deep Learning, Computer Vision, Pattern Recognition, Data Science, Data Analysis, Software Engineer, Data Analyst, C, PySpark, Kubeflow.ABOUT SKILLS Customer browsing patterns. Predict potential RTO(Return To Origin) orders for e- commerce. Object Detection.PROJECTS ACTIVITES", "annotations": [ [657, 665, "SKILL: Building"], [822, 828, "SKILL: python"], [811, 815, "SKILL: java"], [781, 790, "SKILL: Knowledge"], [877, 887, "SKILL: Processing"], [194, 205, "SKILL: performance"], [442, 452, "SKILL: Technology"], [1007, 1014, "SKILL: PySpark"], [30, 44, "SKILL: Data Scientist"], ... ] }
用途
该数据集可用于:
- 训练命名实体识别(NER)模型,以从文本中识别IT技能。
- 评估NER模型在从简历中提取IT技能方面的性能。
- 开发新的NLP应用程序,用于技能提取和职位匹配。




