cometadata/datacite-titles-descriptions-related-identifiers
收藏Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cometadata/datacite-titles-descriptions-related-identifiers
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
task_categories:
- text-classification
- feature-extraction
language:
- multilingual
tags:
- datacite
- scholarly-metadata
- datasets
- research-data
size_categories:
- 10M<n<100M
---
# DataCite Dataset Titles, Descriptions, and Related Identifiers
Structured extraction of titles, descriptions, and related identifiers for all records with `resourceTypeGeneral: Dataset` in the DataCite metadata corpus.
## Source
Parsed from the **2026-03 DataCite Monthly Data File**.
## Contents
- **61,096,014 records** (one row per DOI)
- Filtered to `resourceTypeGeneral: Dataset` only
- Single Parquet file with nested columns
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `doi` | `string` | The DOI identifier |
| `provider_id` | `string` | DataCite provider ID |
| `client_id` | `string` | DataCite client ID |
| `titles` | `list<struct>` | Array of `{title, titleType, lang}` |
| `descriptions` | `list<struct>` | Array of `{description, descriptionType, lang}` |
| `relatedIdentifiers` | `list<struct>` | Array of `{relatedIdentifier, relationType, relatedIdentifierType, resourceTypeGeneral}` |
## Usage
```python
import pyarrow.parquet as pq
table = pq.read_table("data/datasets_output.parquet")
print(table.schema)
print(f"{table.num_rows:,} records")
```
### Streaming with HuggingFace Datasets
```python
from datasets import load_dataset
ds = load_dataset("cometadata/datacite-titles-descriptions-related-identifiers", streaming=True)
for record in ds["train"]:
print(record["doi"], record["titles"])
break
```
## License
CC0 1.0 Universal - the metadata is from DataCite's open metadata corpus.
提供机构:
cometadata



