five

librarian-bots/model_cards_with_metadata_with_embeddings

收藏
Hugging Face2023-12-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/librarian-bots/model_cards_with_metadata_with_embeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- size_categories: - 100K<n<1M task_categories: - text-retrieval pretty_name: Model Card dataset_info: features: - name: modelId dtype: string - name: author dtype: string - name: last_modified dtype: timestamp[us, tz=UTC] - name: downloads dtype: int64 - name: likes dtype: int64 - name: library_name dtype: string - name: tags sequence: string - name: pipeline_tag dtype: string - name: createdAt dtype: timestamp[us, tz=UTC] - name: card dtype: string - name: embedding sequence: float32 splits: - name: train num_bytes: 2104883666.6585019 num_examples: 442651 download_size: 1243809305 dataset_size: 2104883666.6585019 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for Hugging Face Hub Model Cards with Embeddings This dataset consists of [model cards](https://huggingface.co/docs/hub/model-cards) for models hosted on the Hugging Face Hub. The model cards are created by the community and provide information about the model, its performance, its intended uses, and more. This dataset is updated on a daily basis and includes publicly available models on the Hugging Face Hub. This dataset is made available to help support users wanting to work with a large number of Model Cards from the Hub. We hope that this dataset will help support research in the area of Model Cards and their use but the format of this dataset may not be useful for all use cases. If there are other features that you would like to see included in this dataset, please open a new [discussion](https://huggingface.co/datasets/librarian-bots/model_cards_with_metadata/discussions/new). This dataset is the same as the [Hugging Face Hub Model Cards](https://huggingface.co/datasets/librarian-bots/model_cards) dataset but with the addition of embeddings for each model card. The embeddings are generated using the [jinaai/jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en) model. ## Dataset Details ### Dataset Description - **Curated by:** Daniel van Strien - **Language(s) (NLP):** Model cards on the Hugging Face Hub are predominantly in English but may include other languages. ## Uses There are a number of potential uses for this dataset including: - text mining to find common themes in model cards - analysis of the model card format/content - topic modelling of model cards - analysis of the model card metadata - training language models on model cards - build a recommender system for model cards - build a search engine for model cards ### Out-of-Scope Use [More Information Needed] ## Dataset Structure This dataset has a single split. ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> The dataset was created to assist people in working with model cards. In particular. it was created to support research in the area of model cards and their use. It is possible to use the Hugging Face Hub API or client library to download model cards and this option may be preferable if you have a very specific use case or require a different format. ### Source Data The source data is `README.md` files for models hosted on the Hugging Face Hub. We do not include any other supplementary files that may be included in the model card directory. #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> The data is downloaded using a CRON job on a daily basis. #### Who are the source data producers? The source data producers are the creators of the model cards on the Hugging Face Hub. This includes a broad variety of people from the community ranging from large companies to individual researchers. We do not gather any information about who created the model card in this repository although this information can be gathered from the Hugging Face Hub API. ### Annotations [optional] There are no additional annotations in this dataset beyond the model card content. #### Annotation process N/A #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> N/A #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> We make no effort to anonymize the data. Whilst we don't expect the majority of model cards to contain personal or sensitive information, it is possible that some model cards may contain this information. Model cards may also link to websites or email addresses. ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> Model cards are created by the community and we do not have any control over the content of the model cards. We do not review the content of the model cards and we do not make any claims about the accuracy of the information in the model cards. Some model cards will themselves discuss bias and sometimes this is done by providing examples of bias in either the training data or the responses provided by the model. As a result this dataset may contain examples of bias. Whilst we do not directly download any images linked to the model cards, some model cards may include images. Some of these images may not be suitable for all audiences. ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> ## Citation No formal citation is required for this dataset but if you use this dataset in your work, please include a link to this dataset page. ## Dataset Card Authors [@davanstrien](https://huggingface.co/davanstrien) ## Dataset Card Contact [@davanstrien](https://huggingface.co/davanstrien)
提供机构:
librarian-bots
原始信息汇总

数据集卡片概述

数据集描述

数据集详情

  • 数据集名称: Hugging Face Hub Model Cards with Embeddings
  • 数据集大小类别: 100K<n<1M
  • 任务类别: 文本检索
  • 数据集特征:
    • modelId: 字符串类型
    • author: 字符串类型
    • last_modified: 时间戳类型(微秒,UTC时区)
    • downloads: 64位整数类型
    • likes: 64位整数类型
    • library_name: 字符串类型
    • tags: 字符串序列类型
    • pipeline_tag: 字符串类型
    • createdAt: 时间戳类型(微秒,UTC时区)
    • card: 字符串类型
    • embedding: 浮点32位序列类型
  • 数据集分割:
    • train: 包含442651个样本,总大小为2104883666.6585019字节
  • 下载大小: 1243809305字节
  • 数据集大小: 2104883666.6585019字节

数据集配置

  • 配置名称: default
    • 数据文件:
      • split: train
      • path: data/train-*

数据集创建

  • 数据来源: Hugging Face Hub上的模型卡片README.md文件
  • 数据收集和处理: 使用CRON作业每日下载数据
  • 数据生产者: 模型卡片的创建者,包括社区中的各种人员,从大型公司到个人研究人员

数据集用途

  • 潜在用途:
    • 文本挖掘以发现模型卡片中的常见主题
    • 分析模型卡片的格式/内容
    • 模型卡片的话题建模
    • 分析模型卡片的元数据
    • 在模型卡片上训练语言模型
    • 构建模型卡片的推荐系统
    • 构建模型卡片的搜索引擎

数据集限制

  • 数据集内容: 模型卡片由社区创建,内容不受控制,可能包含偏见或不适合所有受众的图像
  • 个人和敏感信息: 未进行匿名化处理,模型卡片可能包含个人或敏感信息

推荐

  • 使用建议: 在使用数据集时,应考虑其可能包含的偏见和敏感信息

数据集作者和联系人

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作