Replication package for the paper: "Datasets, Bias, Licenses, and Terms of Use: A Large and Longitudinal Study on the Documentation of Hugging Face Machine Learning Models"
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/15187256
下载链接
链接失效反馈官方服务:
资源简介:
This replication package contains datasets related to the paper: "Datasets, Bias, Licenses, and Terms of Use: A Large and Longitudinal Study on the Documentation of Hugging Face Machine Learning Models"
This replication package contains the new data used for the journal version of the manuscript, featuring:
All data for the second snapshot (downloaded in September 2024) to answer RQ1, RQ2, and RQ3
Data from both snapshots related to terms of use (RQ4)
Scripts and data from the first snapshot (April 2023) from the ICPC 2024 paper "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" are available at the following link: https://zenodo.org/records/10058142
Root directory
Dataset
Dataset/Dataset_HF_model_list.csv: list of HF models analyzed, with the following information: id,downloads,likes,tags,pipeline_tag,pipeline_category,License,license_model_permissivity
Dataset/Dataset_GitHub_prj_list_Transformers.txt: list of GitHub projects using the transformers library
Dataset/Dataset_GitHub_prj_list_Diffusers.txt: list of GitHub projects using the diffuserslibrary
Dataset/Dataset_GitHub_prj_frompretrained_Transformers.txt: list of GitHub projects using the "from_pretrained" transformers library
Dataset/Dataset_GitHub_prj_frompretrained_Diffusers.txt: list of GitHub projects using the "from_pretrained" diffusers library
Dataset/Dataset_GitHub_prj_model_used_Transformers.csv: contains usage pairs: project, model for transformers library
Dataset/Dataset_GitHub_prj_model_used_Diffusers.csv: contains usage pairs: project, model for diffusers library
Dataset/Dataset_IntersectedModels.csv : contains the models shared between the first and second snapshot for category
Dataset/modelsReadme: contains the model cards belonging to the sample size
Dataset/projects_with_5_or_more_stars.csv: contains the projects with numStars major of 5
Dataset/projects_stars_summary.csv: contains the number of total projects with numStars
RQ1
RQ1/RQ1_dataset_list_HF.txt: list of HF datasets
RQ1/RQ1_datasetTags.txt: list of models declaring the dataset tag
RQ1/RQ1_modelDataset.csv : list of models declaring the dataset tag with their respective datasets
RQ1/RQ1_datasetSample.csv: sample set of models used for the manual analysis of datasets
RQ2
RQ2/RQ2_bias_classification_sheet.csv: results of the manual labeling
RQ3
RQ3/RQ3_License_Models.csv: model license list,categorized by permissiveness, with the respective number of occurrences
RQ3/RQ3_License_prjTransformers.csv: transformers project license list, categorized by permissiveness, with the respective number of occurrences
RQ3/RQ3_License_prjDiffusers.csv : diffusers project license list, categorized by permissiveness, with the respective number of occurrences
RQ3/RQ3_prj_model_license_permissivity_Transformers_Diffusers.csv: total list of projects that reuse the models, with their respective licenses and permissiveness related to Transformers and Diffuserslibrary
RQ3/RQ3_prj_model_license_permissivity_Transformers_Diffusers_Starmajor5.csv: total list of projects that reuse the models, with their respective licenses and permissiveness related to Transformers and Diffusers library for numStar > 5
RQ3/RQ3_Contingency_Matrix_permissivity_Transformers_Diffusers.csv: usage contingency table between projects' licenses (columns) and models' licenses (rows) related to Transformers and Diffusers library in terms of permissiveness
RQ3/RQ3_Contingency_Matrix_licenses_Transformers_Diffusers.csv: usage contingency table between projects' licenses (columns) and models' licenses (rows) related to Transformers and Diffuserslibrary in terms of licenses
RQ3/RQ3_Contingency_Matrix_permissivity_Transformers_Diffusers_Starmajor5.csv: usage contingency table between projects' licenses (columns) and models' licenses (rows) related to Transformers and Diffusers library in terms of permissiveness for projects with numStar > 5
RQ3/RQ3_Contingency_Matrix_licenses_Transformers_Diffusers_Starmajor5.csv: usage contingency table between projects' licenses (columns) and models' licenses (rows) related to Transformers and Diffusers library in terms of licenses for projects with numStar > 5
RQ4
RQ4/RQ4_Terms_of_Use_Snapshot1.csv: results of the manual labeling related to terms of use for the first snapshot
RQ4/RQ4_Terms_of_Use_Snapshot2.csv: results of the manual labeling related to terms of use for the second snapshot
创建时间:
2025-04-10



