five

cometadata/comet-datacite-enrichment-layer-ingest-format

收藏
Hugging Face2026-02-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cometadata/comet-datacite-enrichment-layer-ingest-format
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 language: - en tags: - metadata - scholarly-communication - datacite - doi - enrichment - comet pretty_name: COMET DataCite Enrichment Layer (Ingest Format) size_categories: - 10M<n<100M configs: - config_name: author_affiliations data_files: - split: train path: author_affiliations_enrichments_datacite_format.jsonl default: true - config_name: preprint_matching data_files: - split: train path: preprint_matching_enrichments_datacite_format.jsonl - config_name: resource_type data_files: - split: train path: resource_type_enrichments_datacite_format.jsonl --- # COMET Enrichment Files - DataCite Enrichment Layer Ingest Format This dataset contains metadata enrichment records in the [DataCite](https://datacite.org/) enrichment ingest format, produced by the [COMET](https://cometadata.org/) (Collaborative Metadata) initiative. Each record represents a single metadata enrichment to be applied to an existing DataCite record, identified by DOI. The enrichments were generated against the February 2026 version of the [DataCite monthly data file](https://support.datacite.org/docs/datacite-monthly-data-file) using the [datacite-enrichment](https://github.com/cometadata/datacite-enrichment) tool. ## Dataset Configs This dataset contains three configs (subsets), one for each enrichment type: | Config | Records | Size | Description | |--------|---------|------|-------------| | `author_affiliations` | 8,963,119 | 8.8 GB | Institutional affiliation enrichments for creator records | | `preprint_matching` | 812,958 | 526 MB | Links between arXiv preprints and their published versions | | `resource_type` | 3,338,250 | 2.1 GB | Reclassifications of generic resource types to specific types | ## Record Schema Each record is a JSON object with the following fields: | Field | Type | Description | |-------|------|-------------| | `doi` | string | The DOI of the DataCite record to be enriched | | `contributors` | array | Provenance: who produced this enrichment (COMET as Producer) | | `resources` | array | Provenance: related project DOI and source dataset reference | | `action` | string | The enrichment action: `insert`, `updateChild`, or `update` | | `field` | string | The DataCite metadata field being enriched | | `originalValue` | object | The original value being modified (present for `updateChild` and `update` actions) | | `enrichedValue` | object | The new or updated value to apply | ## Config Details ### author_affiliations Adds institutional affiliations to creator records on arXiv preprints registered in DataCite. - **Action**: `updateChild` — finds and updates a specific creator within the `creators` array - **Field**: `creators` - **Source data**: [cometadata/arxiv-author-affiliations-matched-ror-ids](https://huggingface.co/datasets/cometadata/arxiv-author-affiliations-matched-ror-ids) - **Project DOI**: [10.82461/160e-8q92](https://doi.org/10.82461/160e-8q92) Example record: ```json { "doi": "10.48550/arxiv.1109.4119", "contributors": [ {"name": "COMET", "contributorType": "Producer", "nameType": "Organizational"} ], "resources": [ {"relatedIdentifier": "http://doi.org/10.82461/160e-8q92", "relatedIdentifierType": "DOI", "relationType": "IsDocumentedBy", "resourceTypeGeneral": "Project"}, {"relatedIdentifier": "https://huggingface.co/datasets/cometadata/arxiv-author-affiliations-matched-ror-ids", "relatedIdentifierType": "URL", "relationType": "IsDerivedFrom", "resourceTypeGeneral": "Dataset"} ], "action": "updateChild", "field": "creators", "originalValue": { "nameType": "Personal", "givenName": "Philip D.", "familyName": "Mannheim", "name": "Mannheim, Philip D." }, "enrichedValue": { "nameType": "Personal", "givenName": "Philip D.", "familyName": "Mannheim", "name": "Mannheim, Philip D.", "affiliation": [ { "name": "Department of Physics, University of Connecticut, Storrs, CT 06269, USA", "affiliationIdentifier": "https://ror.org/02der9h97", "affiliationIdentifierScheme": "ROR", "schemeUri": "https://ror.org" } ] } } ``` ### preprint_matching Links arXiv preprints to their published versions by inserting `relatedIdentifiers` entries with the `IsVersionOf` relation type. - **Action**: `insert` — appends a new entry to the `relatedIdentifiers` array - **Field**: `relatedIdentifiers` - **Source data**: [cometadata/arxiv-preprint-matching-results](https://huggingface.co/datasets/cometadata/arxiv-preprint-matching-results) - **Project DOI**: [10.82461/m8a8-m211](https://doi.org/10.82461/m8a8-m211) Example record: ```json { "doi": "10.48550/arxiv.2302.11570", "contributors": [ {"name": "COMET", "contributorType": "Producer", "nameType": "Organizational"} ], "resources": [ {"relatedIdentifier": "10.82461/m8a8-m211", "relatedIdentifierType": "DOI", "relationType": "IsDocumentedBy", "resourceTypeGeneral": "Project"}, {"relatedIdentifier": "https://huggingface.co/datasets/cometadata/arxiv-preprint-matching-results", "relatedIdentifierType": "DOI", "relationType": "IsDerivedFrom", "resourceTypeGeneral": "Dataset"} ], "action": "insert", "field": "relatedIdentifiers", "enrichedValue": { "relatedIdentifier": "10.58530/2024/0038", "relatedIdentifierType": "DOI", "relationType": "IsVersionOf" } } ``` ### resource_type Reclassifies DataCite records that have a generic `resourceTypeGeneral` of "Text" to more specific types such as `JournalArticle`, `Preprint`, etc., based on procedural analysis. - **Action**: `update` — replaces the entire `types` field - **Field**: `types` - **Source data**: [cometadata/datacite-procedural-resource-type-general-reclassifications](https://huggingface.co/datasets/cometadata/datacite-procedural-resource-type-general-reclassifications) - **Project DOI**: [10.82461/bpzr-jd55](https://doi.org/10.82461/bpzr-jd55) Example record: ```json { "doi": "10.5281/zenodo.15414", "contributors": [ {"name": "COMET", "contributorType": "Producer", "nameType": "Organizational"} ], "resources": [ {"relatedIdentifier": "10.82461/bpzr-jd55", "relatedIdentifierType": "DOI", "relationType": "IsDocumentedBy", "resourceTypeGeneral": "Project"}, {"relatedIdentifier": "https://huggingface.co/datasets/cometadata/datacite-procedural-resource-type-general-reclassifications", "relatedIdentifierType": "URL", "relationType": "IsDerivedFrom", "resourceTypeGeneral": "Dataset"} ], "action": "update", "field": "types", "originalValue": {"resourceTypeGeneral": "Text"}, "enrichedValue": {"resourceTypeGeneral": "JournalArticle"} } ``` ## How These Files Were Generated These enrichment records were produced using the [datacite-enrichment](https://github.com/cometadata/datacite-enrichment) tool, which: 1. Takes source enrichment data (from the datasets linked above) and a YAML configuration file defining field mappings, filters, and provenance metadata 2. Converts it to a standardized enrichment format using domain-specific transformers (one per enrichment type) 3. Produces records that carry full provenance tracking: contributor info (COMET as Producer), references to the source dataset and project DOI, and original/enriched values The conversion was run against all files in the February 2026 version of the [DataCite monthly data file](https://support.datacite.org/docs/datacite-monthly-data-file). ## License This dataset is released under [CC0 1.0](https://creativecommons.org/publicdomain/zero/1.0/).
提供机构:
cometadata
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作