LabID-base/OpenAlex-Afillation
收藏Hugging Face2026-03-28 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/LabID-base/OpenAlex-Afillation
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
- zh
- de
- fr
- es
- it
- pt
- ja
- ko
- ar
- ru
- nl
- pl
- tr
tags:
- affiliations
- nlp
- bibliometrics
- openalex
- ner
- institution-disambiguation
- academic
- text
pretty_name: OpenAlex Affiliation Dataset
size_categories:
- 1M<n<10M
task_categories:
- token-classification
- text-classification
configs:
- config_name: "2025-12"
data_files: "data/2025-12/*.csv"
---
# OpenAlex Affiliation Dataset
This dataset provides raw and deduplicated academic affiliation strings from scholarly works published in December 2025. Affiliation strings are the raw, author-written institutional descriptions (e.g., "Department of Computer Science, MIT, Cambridge, MA, USA") that appear in academic papers — before any normalization or entity resolution.
## What are raw affiliation strings?
Affiliation strings are the institutional descriptions authors include in their papers, before any normalization or entity resolution:
```
Department of Computer Science, Stanford University, Stanford, CA 94305, USA
Institut fur Physik, Humboldt-Universitat zu Berlin, 12489 Berlin, Germany
Faculdade de Medicina, Universidade de Sao Paulo, Sao Paulo, Brasil
```
## Use cases
- **Institution disambiguation / NER** — parse and normalize to known entities (ROR, GRID, Wikidata)
- **NLP training data** — multilingual academic text for span detection, entity linking
- **Bibliometrics** — institutional analytics, collaboration networks
- **Affiliation normalization** — training data for models like AffilGood, S2AFF
## Data source & provenance
**Source:** [OpenAlex](https://openalex.org) — fully open index of scholarly works by OurResearch. CC BY 4.0.
**Pipeline:** [labid-base/openalex-pipeline](https://github.com/labid-base/openalex-pipeline)
Each chunk is deduplicated independently. `work_id` is the first work in which each string appeared within the chunk.
## Quick start
```python
from datasets import load_dataset
ds = load_dataset("LabID-base/OpenAlex-Afillation", "2025-12")
print(ds["train"][0])
# {"work_id": "https://openalex.org/W...", "raw_affiliation_string": "Department of..."}
```
## Dataset statistics
| Month | Collection date | Works | Total entries | Unique strings | Chunks |
|-------|----------------|-------|---------------|----------------|--------|
| 2025-12 | 2026-03-27 | 704,702 | 3,595,056 | **1,557,802** | 71 |
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `work_id` | string | OpenAlex work ID (e.g. `https://openalex.org/W2741809807`) |
| `raw_affiliation_string` | string | Raw affiliation text as written by the author |
## Directory structure
```
data/
2025-12/
works_2025_12_chunk_0001.csv
...
works_2025_12_chunk_0071.csv (71 chunks, ~22K strings each)
```
## Update schedule
Updated **monthly**. Each update adds a new `data/{YYYY}-{MM}/` folder.
| Release | Period | Status |
|---------|--------|--------|
| v1 | 2025-12 | Available |
| v2 | 2026-01 | Planned |
## Citation
```bibtex
@misc{priem2022openalex,
title={OpenAlex: A fully-open index of the world's research works},
author={Priem, Jason and Piwowar, Heather and Orr, Richard},
year={2022},
eprint={2205.01833},
archivePrefix={arXiv}
}
```
---
许可证: CC BY 4.0
语言:
- 英语
- 汉语
- 德语
- 法语
- 西班牙语
- 意大利语
- 葡萄牙语
- 日语
- 韩语
- 阿拉伯语
- 俄语
- 荷兰语
- 波兰语
- 土耳其语
标签:
- 机构附属信息
- 自然语言处理(NLP)
- 文献计量学(Bibliometrics)
- OpenAlex
- 命名实体识别(NER)
- 机构消歧
- 学术领域
- 文本
数据集名称: OpenAlex机构附属信息数据集(OpenAlex Affiliation Dataset)
规模分类: 100万<条目数<1000万
任务分类:
- Token分类
- 文本分类
配置项:
- 配置名称: "2025-12"
数据文件: "data/2025-12/*.csv"
---
# OpenAlex机构附属信息数据集
本数据集提供2025年12月发表的学术论文中未经过标准化处理或实体消解的原始学术机构附属信息字符串及其去重版本。附属信息字符串即作者在学术论文中撰写的原始机构描述(例如:"Department of Computer Science, MIT, Cambridge, MA, USA"),尚未经过任何归一化或实体关联操作。
## 何为原始附属信息字符串?
附属信息字符串即作者在论文中添加的机构描述,未经过任何标准化或实体消解处理:
Department of Computer Science, Stanford University, Stanford, CA 94305, USA
Institut fur Physik, Humboldt-Universitat zu Berlin, 12489 Berlin, Germany
Faculdade de Medicina, Universidade de Sao Paulo, Sao Paulo, Brasil
## 应用场景
- **机构消歧 / 命名实体识别(NER)** — 可针对ROR、GRID、Wikidata等已知实体进行解析与标准化
- **自然语言处理(NLP)训练数据** — 适用于跨度检测、实体链接任务的多语言学术文本
- **文献计量学** — 可用于机构分析、合作网络研究
- **附属信息标准化** — 可作为AffilGood、S2AFF等模型的训练数据
## 数据来源与溯源
**来源:** [OpenAlex](https://openalex.org) — 由OurResearch发布的完全开放的全球学术论文索引,采用CC BY 4.0许可协议。
**处理流程:** [labid-base/openalex-pipeline](https://github.com/labid-base/openalex-pipeline)
每个数据块均独立完成去重处理。`work_id`为该附属信息字符串在此数据块中首次出现的论文ID。
## 快速上手
python
from datasets import load_dataset
ds = load_dataset("LabID-base/OpenAlex-Afillation", "2025-12")
print(ds["train"][0])
# {"work_id": "https://openalex.org/W...", "raw_affiliation_string": "Department of..."}
## 数据集统计
| 月份 | 采集日期 | 论文总数 | 总条目数 | 唯一字符串数 | 数据块数 |
|-------|----------------|-------|---------------|----------------|--------|
| 2025-12 | 2026-03-27 | 704,702 | 3,595,056 | **1,557,802** | 71 |
## 数据结构
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `work_id` | 字符串 | OpenAlex论文ID(例如:`https://openalex.org/W2741809807`) |
| `raw_affiliation_string` | 字符串 | 作者撰写的原始附属信息文本 |
## 目录结构
data/
2025-12/
works_2025_12_chunk_0001.csv
...
works_2025_12_chunk_0071.csv (71 chunks, ~22K strings each)
## 更新计划
**更新频率:** 每月更新。每次更新将新增一个`data/{YYYY}-{MM}/`格式的文件夹。
| 版本号 | 对应周期 | 状态 |
|---------|--------|--------|
| v1 | 2025-12 | 已发布 |
| v2 | 2026-01 | 规划中 |
## 引用格式
bibtex
@misc{priem2022openalex,
title={OpenAlex: A fully-open index of the world's research works},
author={Priem, Jason and Piwowar, Heather and Orr, Richard},
year={2022},
eprint={2205.01833},
archivePrefix={arXiv}
}
提供机构:
LabID-base



