cometadata/datacite-affiliations-matched-ror-ids-datacite-enrichment-format
收藏Hugging Face2026-04-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cometadata/datacite-affiliations-matched-ror-ids-datacite-enrichment-format
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
language:
- en
tags:
- ror
- datacite
- affiliations
- scholarly-metadata
- research-organizations
pretty_name: DataCite Author Affiliations Matched to ROR IDs (DataCite Enrichment Format)
size_categories:
- 10M<n<100M
---
# DataCite Author Affiliations Matched to ROR IDs - DataCite Enrichment Format
This dataset contains 20,512,320 enrichment records mapping author affiliation strings from DataCite metadata to [Research Organization Registry (ROR)](https://ror.org) identifiers. It covers 5,812,774 unique DOIs from the [DataCite Public Data File](https://support.datacite.org/docs/datacite-public-data-file).
Each record is formatted as a DataCite enrichment input record, designed for use with the [DataCite enrichment pipeline](https://github.com/cometadata/datacite-enrichment). Records use the `updateChild` action on the `creators` field, providing a per-creator `originalValue`/`enrichedValue` pair that adds ROR identifiers to existing affiliation strings.
## Record Structure
| Field | Description |
|-------|-------------|
| `doi` | The DOI of the DataCite record |
| `contributors` | Provenance metadata identifying the producer of the enrichment |
| `resources` | Related resources (source dataset) |
| `field` | `"creators"` - the DataCite metadata field being enriched |
| `action` | `"updateChild"` - the enrichment action type |
| `originalValue` | The creator object in its pre-enrichment state, including name fields (`name`, `nameType`, `givenName`, `familyName`) and the original affiliation array |
| `enrichedValue` | The creator object with ROR identifiers added to matched affiliations (`affiliationIdentifier`, `affiliationIdentifierScheme`, `schemeUri`) |
## Example Record
```json
{
"doi": "10.5281/zenodo.8099012",
"contributors": [
{
"name": "COMET",
"nameType": "Organizational",
"contributorType": "Producer"
}
],
"resources": [
{
"relatedIdentifier": "https://huggingface.co/datasets/cometadata/datacite-affiliations-matched-ror-ids-datacite-enrichment-format",
"relatedIdentifierType": "URL",
"relationType": "IsDerivedFrom",
"resourceTypeGeneral": "Dataset"
}
],
"field": "creators",
"action": "updateChild",
"originalValue": {
"affiliation": [
{
"name": "Purple Mountain Observatory, Chinese Academy of Sciences"
}
],
"familyName": "Hsu",
"givenName": "Weibiao",
"name": "Hsu, Weibiao"
},
"enrichedValue": {
"affiliation": [
{
"affiliationIdentifier": "https://ror.org/034t30j35",
"affiliationIdentifierScheme": "ROR",
"name": "Purple Mountain Observatory, Chinese Academy of Sciences",
"schemeUri": "https://ror.org"
}
],
"familyName": "Hsu",
"givenName": "Weibiao",
"name": "Hsu, Weibiao"
}
}
```
## Methodology
[match-datacite-affiliations-to-ror-ids](https://github.com/cometadata/match-datacite-affiliations-to-ror-ids) was used to extract affiliations from the DataCite Public Data File, match them to ROR IDs, and reconcile the results into enrichment records.
1. Unique affiliation strings were extracted from the DataCite Public Data File
2. Each affiliation string was matched to a ROR ID using the [single search matching strategy](https://ror.readme.io/v2/docs/api-affiliation) in the ROR API, with [v2.3 of the ROR data](https://doi.org/10.5281/zenodo.18761279)
3. Matched ROR IDs were reconciled back to individual DOI/author/affiliation records
4. Only creators with at least one successful ROR match are included
5. Affiliations that already had ROR IDs assigned in the source DataCite metadata were excluded from matching
## Statistics
| Metric | Value |
|--------|-------|
| Total enrichment records | 20,512,320 |
| Unique DOIs | 5,812,774 |
| Unique affiliation strings extracted | 1,444,079 |
| Affiliation strings matched to ROR IDs | 763,766 |
| ROR data release | v2.3 (2026-02-24) |
## File Format
JSONL (JSON Lines), one enrichment record per line.
## Source Code
- [match-datacite-affiliations-to-ror-ids](https://github.com/cometadata/match-datacite-affiliations-to-ror-ids) - Extract affiliations, match to ROR IDs, and reconcile into enrichment records
- [datacite-enrichment](https://github.com/cometadata/datacite-enrichment) - DataCite enrichment pipeline
## Related Datasets
- [datacite-affiliations-matched-ror-ids](https://huggingface.co/datasets/cometadata/datacite-affiliations-matched-ror-ids) - Source matching data in tabular format
许可证: CC0-1.0
语言:
- 英语
标签:
- ROR
- DataCite
- 机构关联
- 学术元数据
- 研究机构
数据集显示名称: 匹配ROR标识符的DataCite作者机构信息(DataCite富集格式)
数据量范围: 1000万 < 记录数 < 1亿
# 匹配ROR标识符的DataCite作者机构信息——DataCite富集格式
本数据集包含20,512,320条富集记录,用于将DataCite元数据中的作者机构字符串映射至[研究组织注册表(Research Organization Registry, ROR)](https://ror.org)标识符。数据集覆盖了来自[DataCite公开数据文件](https://support.datacite.org/docs/datacite-public-data-file)的5,812,774个唯一数字对象标识符(DOI)。
每条记录均采用DataCite富集输入记录格式,专为配合[DataCite富集流水线](https://github.com/cometadata/datacite-enrichment)使用而设计。记录在`creators`(创作者)字段上使用`updateChild`(更新子项)操作,提供每条创作者的`originalValue`(原始值)/`enrichedValue`(富集后值)对,用于为现有机构字符串添加ROR标识符。
## 记录结构
| 字段 | 描述 |
|-------|-------------|
| `doi` | DataCite记录的数字对象标识符(DOI) |
| `contributors` | 用于标识富集生成方的溯源元数据 |
| `resources` | 相关资源(源数据集) |
| `field` | `"creators"` —— 待富集的DataCite元数据字段 |
| `action` | `"updateChild"` —— 富集操作类型 |
| `originalValue` | 富集前的创作者对象,包含姓名字段(`name`、`nameType`、`givenName`、`familyName`)以及原始机构数组 |
| `enrichedValue` | 已为匹配到的机构添加ROR标识符的创作者对象,包含`affiliationIdentifier`(机构标识符)、`affiliationIdentifierScheme`(机构标识符方案)、`schemeUri`(方案统一资源标识符) |
## 示例记录
json
{
"doi": "10.5281/zenodo.8099012",
"contributors": [
{
"name": "COMET",
"nameType": "Organizational",
"contributorType": "Producer"
}
],
"resources": [
{
"relatedIdentifier": "https://huggingface.co/datasets/cometadata/datacite-affiliations-matched-ror-ids-datacite-enrichment-format",
"relatedIdentifierType": "URL",
"relationType": "IsDerivedFrom",
"resourceTypeGeneral": "Dataset"
}
],
"field": "creators",
"action": "updateChild",
"originalValue": {
"affiliation": [
{
"name": "Purple Mountain Observatory, Chinese Academy of Sciences"
}
],
"familyName": "Hsu",
"givenName": "Weibiao",
"name": "Hsu, Weibiao"
},
"enrichedValue": {
"affiliation": [
{
"affiliationIdentifier": "https://ror.org/034t30j35",
"affiliationIdentifierScheme": "ROR",
"name": "Purple Mountain Observatory, Chinese Academy of Sciences",
"schemeUri": "https://ror.org"
}
],
"familyName": "Hsu",
"givenName": "Weibiao",
"name": "Hsu, Weibiao"
}
}
## 方法学
[match-datacite-affiliations-to-ror-ids](https://github.com/cometadata/match-datacite-affiliations-to-ror-ids)工具用于从DataCite公开数据文件中提取机构信息,将其匹配至ROR标识符,并将结果整合为富集记录。具体流程如下:
1. 从DataCite公开数据文件中提取唯一的机构字符串
2. 使用ROR API的[单次搜索匹配策略](https://ror.readme.io/v2/docs/api-affiliation),结合[ROR数据v2.3版本](https://doi.org/10.5281/zenodo.18761279),将每条机构字符串匹配至对应的ROR标识符
3. 将匹配得到的ROR标识符整合回单个DOI/创作者/机构记录中
4. 仅保留至少成功匹配到一个ROR标识符的创作者记录
5. 排除源DataCite元数据中已带有ROR标识符的机构,不参与匹配流程
## 统计数据
| 指标 | 数值 |
|--------|-------|
| 总富集记录数 | 20,512,320 |
| 唯一DOI数量 | 5,812,774 |
| 提取的唯一机构字符串总数 | 1,444,079 |
| 匹配至ROR标识符的机构字符串数 | 763,766 |
| ROR数据发布版本 | v2.3(2026-02-24) |
## 文件格式
JSONL(JSON行格式),每行对应一条富集记录。
## 源代码
- [match-datacite-affiliations-to-ror-ids](https://github.com/cometadata/match-datacite-affiliations-to-ror-ids) —— 提取机构信息、匹配至ROR标识符并整合为富集记录的工具
- [datacite-enrichment](https://github.com/cometadata/datacite-enrichment) —— DataCite富集流水线工具
## 相关数据集
- [datacite-affiliations-matched-ror-ids](https://huggingface.co/datasets/cometadata/datacite-affiliations-matched-ror-ids) —— 表格格式的源匹配数据
提供机构:
cometadata



