five

jimnoneill/paper-to-field-training

收藏
Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jimnoneill/paper-to-field-training
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: en license: mit tags: - openalex - scientific-papers - topic-classification - taxonomy - deepseek size_categories: - 100K<n<1M --- # Paper-to-Field Training Dataset (v3 — DeepSeek-annotated) ~200K domain-balanced scientific paper abstracts with DeepSeek-verified field labels, sourced from [OpenAlex bulk data](https://docs.openalex.org/download-all-data/openalex-snapshot). ## Dataset Details - **Size**: 199,998 records - **Balance**: ~50,000 per domain (Life Sciences, Social Sciences, Physical Sciences, Health Sciences) - **Label source**: DeepSeek LLM re-annotation of OpenAlex field labels (original OpenAlex labels ~50% noisy) - **Confidence filter**: DeepSeek confidence >= 0.7 recommended for training - **Format**: JSONL ## Fields | Field | Description | |-------|-------------| | `title` | Paper title | | `abstract` | Full abstract text | | `field_id` | DeepSeek-assigned field ID (26 fields) | | `field_name` | DeepSeek-assigned field name | | `domain_id` | Domain ID (4 domains) | | `domain_name` | Domain name | | `confidence` | DeepSeek classification confidence (0-1) | | `openalex_field_id` | Original OpenAlex field ID | | `openalex_field_name` | Original OpenAlex field name | ## Usage ```python from datasets import load_dataset ds = load_dataset("jimnoneill/paper-to-field-training") ```
提供机构:
jimnoneill
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作