jimnoneill/paper-to-field-training
收藏Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jimnoneill/paper-to-field-training
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
license: mit
tags:
- openalex
- scientific-papers
- topic-classification
- taxonomy
- deepseek
size_categories:
- 100K<n<1M
---
# Paper-to-Field Training Dataset (v3 — DeepSeek-annotated)
~200K domain-balanced scientific paper abstracts with DeepSeek-verified field labels,
sourced from [OpenAlex bulk data](https://docs.openalex.org/download-all-data/openalex-snapshot).
## Dataset Details
- **Size**: 199,998 records
- **Balance**: ~50,000 per domain (Life Sciences, Social Sciences, Physical Sciences, Health Sciences)
- **Label source**: DeepSeek LLM re-annotation of OpenAlex field labels (original OpenAlex labels ~50% noisy)
- **Confidence filter**: DeepSeek confidence >= 0.7 recommended for training
- **Format**: JSONL
## Fields
| Field | Description |
|-------|-------------|
| `title` | Paper title |
| `abstract` | Full abstract text |
| `field_id` | DeepSeek-assigned field ID (26 fields) |
| `field_name` | DeepSeek-assigned field name |
| `domain_id` | Domain ID (4 domains) |
| `domain_name` | Domain name |
| `confidence` | DeepSeek classification confidence (0-1) |
| `openalex_field_id` | Original OpenAlex field ID |
| `openalex_field_name` | Original OpenAlex field name |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("jimnoneill/paper-to-field-training")
```
提供机构:
jimnoneill



