Dataset metadata of known Dataverse installations, August 2024
收藏DataCite Commons2025-01-02 更新2025-04-15 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/2SA6SN
下载链接
链接失效反馈官方服务:
资源简介:
<p>This dataset contains the metadata of the datasets published in 101 Dataverse installations, information about the metadata blocks of 106 installations, and the lists of pre-defined licenses or dataset terms that depositors can apply to datasets in the 88 installations that were running versions of the Dataverse software that include the "multiple-license" feature.
<p>The data is useful for improving understandings about how certain Dataverse features and metadata fields are used and for learning about the quality of dataset and file-level metadata within and across Dataverse installations.
<p><strong>How the metadata was downloaded</strong>
<p>The dataset metadata and metadata block JSON files were downloaded from each installation between August 25 and August 30, 2024 using a "get_dataverse_installations_metadata" function in a collection of Python functions at <a href="https://github.com/jggautier/dataverse-scripts/blob/main/dataverse_repository_curation_assistant/dataverse_repository_curation_assistant_functions.py">https://github.com/jggautier/dataverse-scripts/blob/main/dataverse_repository_curation_assistant/dataverse_repository_curation_assistant_functions.py</a>.
<p>In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL for which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens in order to use certain API endpoints.
<p><strong>How the files are organized</strong>
<pre>
├── csv_files_with_metadata_from_most_known_dataverse_installations
│ ├── author_2024.08.25-2024.08.30.csv
│ ├── contributor_2024.08.25-2024.08.30.csv
│ ├── data_source_2024.08.25-2024.08.30.csv
│ ├── ...
│ └── topic_classification_2024.08.25-2024.08.30.csv
├── dataverse_json_metadata_from_each_known_dataverse_installation
│ ├── Abacus_2024.08.26_15.52.42.zip
│ ├── dataset_pids_Abacus_2024.08.26_15.52.42.csv
│ ├── Dataverse_JSON_metadata_2024.08.26_15.52.42
│ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json
│ ├── ...
│ ├── metadatablocks_v5.9
│ ├── astrophysics_v5.9.json
│ ├── biomedical_v5.9.json
│ ├── citation_v5.9.json
│ ├── ...
│ ├── socialscience_v5.6.json
│ ├── ACSS_Dataverse_2024.08.26_00.02.51.zip
│ ├── ...
│ └── Yale_Dataverse_2024.08.25_03.52.57.zip
└── dataverse_installations_summary_2024.08.30.csv
└── dataset_pids_from_most_known_dataverse_installations_2024.08.csv
└── license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv
└── metadatablocks_from_most_known_dataverse_installations_2024.08.30.csv
</pre>
<p>This dataset contains two directories and four CSV files not in a directory.
<p>One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the "Citation" metadata block and "Geospatial" metadata block of datasets in the 101 Dataverse installations. For example, author_2024.08.25-2024.08.30.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in 101 installations, with a column for each of the four child fields: author name, affiliation, identifier type, and identifier.
<p>The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 106 zip files, one zip file for each of the 106 Dataverse installations whose sites were functioning when I attempted to collect their metadata. Each zip file contains a directory with JSON files that have information about the installation's metadata fields, such as the field names and how they're organized. For installations that had published datasets, and I was able to use Dataverse APIs to download the dataset metadata, the zip file also contains:
<ul>
<li>A CSV file listing information about the datasets published in the installation, including a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset.
<li>A directory of JSON files that contain the metadata of the installation's published, non-deaccessioned dataset versions in the Dataverse JSON metadata schema.
</ul>
<p>The dataverse_installations_summary_2024.08.30.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata included and not included in this dataset.
<p>The dataset_pids_from_most_known_dataverse_installations_2024.08.csv file contains the dataset PIDs of published datasets in 101 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all "dataset_pids_....csv" files in each of the 101 zip files in the dataverse_json_metadata_from_each_known_dataverse_installation directory.
<p>The license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv file contains information about the licenses and data use agreements that some installations let depositors choose when creating datasets. When I collected this data, 88 of the available 106 installations were running versions of the Dataverse software that allow depositors to choose a "predefined license or data use agreement" from a dropdown menu in the dataset deposit form. For more information about this Dataverse feature, see <a href="https://guides.dataverse.org/en/5.14/user/dataset-management.html#choosing-a-license">https://guides.dataverse.org/en/5.14/user/dataset-management.html#choosing-a-license</a>.
<p>The metadatablocks_from_most_known_dataverse_installations_2024.08.30.csv file contains the metadata block names, field names, child field names (if the field is a compound field), display names, descriptions/tooltip text, and watermarks of fields in the 106 Dataverse installations' metadata blocks. This file is useful for learning about the metadata fields and field structures used in each installation. The CSV file was created using a Python script at <a href="https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_csv_file_with_metadata_block_fields_of_all_installations.py">https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_csv_file_with_metadata_block_fields_of_all_installations.py</a>, which finds each installation's metadata block JSON files and extracts from them information about each field.
<p><strong>Known errors</strong>
<p>The metadata of a few datasets from several known and functioning installations could not be downloaded.
<p>In some cases, this is because of download timeouts caused by the datasets' relatively large metadata exports, which contain information about the datasets' large number of versions and files.
<p>In other cases, datasets were publicly findable but in unpublished or deaccessioned states that prevented me from downloading their metadata export.
<p><strong>About metadata blocks</strong>
<p>Read about the Dataverse software's metadata blocks system at <a href="http://guides.dataverse.org/en/latest/admin/metadatacustomization.html">http://guides.dataverse.org/en/6.3/admin/metadatacustomization.html</a>
提供机构:
Harvard Dataverse
创建时间:
2024-08-27



