librarian-bots/model_cards_with_metadata_with_embeddings

Name: librarian-bots/model_cards_with_metadata_with_embeddings
Creator: librarian-bots
Published: 2023-12-18 10:15:58
License: 暂无描述

Hugging Face2023-12-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/librarian-bots/model_cards_with_metadata_with_embeddings

下载链接

链接失效反馈

官方服务：

资源简介：

--- size_categories: - 100K<n<1M task_categories: - text-retrieval pretty_name: Model Card dataset_info: features: - name: modelId dtype: string - name: author dtype: string - name: last_modified dtype: timestamp[us, tz=UTC] - name: downloads dtype: int64 - name: likes dtype: int64 - name: library_name dtype: string - name: tags sequence: string - name: pipeline_tag dtype: string - name: createdAt dtype: timestamp[us, tz=UTC] - name: card dtype: string - name: embedding sequence: float32 splits: - name: train num_bytes: 2104883666.6585019 num_examples: 442651 download_size: 1243809305 dataset_size: 2104883666.6585019 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for Hugging Face Hub Model Cards with Embeddings This dataset consists of [model cards](https://huggingface.co/docs/hub/model-cards) for models hosted on the Hugging Face Hub. The model cards are created by the community and provide information about the model, its performance, its intended uses, and more. This dataset is updated on a daily basis and includes publicly available models on the Hugging Face Hub. This dataset is made available to help support users wanting to work with a large number of Model Cards from the Hub. We hope that this dataset will help support research in the area of Model Cards and their use but the format of this dataset may not be useful for all use cases. If there are other features that you would like to see included in this dataset, please open a new [discussion](https://huggingface.co/datasets/librarian-bots/model_cards_with_metadata/discussions/new). This dataset is the same as the [Hugging Face Hub Model Cards](https://huggingface.co/datasets/librarian-bots/model_cards) dataset but with the addition of embeddings for each model card. The embeddings are generated using the [jinaai/jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en) model. ## Dataset Details ### Dataset Description - **Curated by:** Daniel van Strien - **Language(s) (NLP):** Model cards on the Hugging Face Hub are predominantly in English but may include other languages. ## Uses There are a number of potential uses for this dataset including: - text mining to find common themes in model cards - analysis of the model card format/content - topic modelling of model cards - analysis of the model card metadata - training language models on model cards - build a recommender system for model cards - build a search engine for model cards ### Out-of-Scope Use [More Information Needed] ## Dataset Structure This dataset has a single split. ## Dataset Creation ### Curation Rationale  The dataset was created to assist people in working with model cards. In particular. it was created to support research in the area of model cards and their use. It is possible to use the Hugging Face Hub API or client library to download model cards and this option may be preferable if you have a very specific use case or require a different format. ### Source Data The source data is `README.md` files for models hosted on the Hugging Face Hub. We do not include any other supplementary files that may be included in the model card directory. #### Data Collection and Processing  The data is downloaded using a CRON job on a daily basis. #### Who are the source data producers? The source data producers are the creators of the model cards on the Hugging Face Hub. This includes a broad variety of people from the community ranging from large companies to individual researchers. We do not gather any information about who created the model card in this repository although this information can be gathered from the Hugging Face Hub API. ### Annotations [optional] There are no additional annotations in this dataset beyond the model card content. #### Annotation process N/A #### Who are the annotators?  N/A #### Personal and Sensitive Information  We make no effort to anonymize the data. Whilst we don't expect the majority of model cards to contain personal or sensitive information, it is possible that some model cards may contain this information. Model cards may also link to websites or email addresses. ## Bias, Risks, and Limitations  Model cards are created by the community and we do not have any control over the content of the model cards. We do not review the content of the model cards and we do not make any claims about the accuracy of the information in the model cards. Some model cards will themselves discuss bias and sometimes this is done by providing examples of bias in either the training data or the responses provided by the model. As a result this dataset may contain examples of bias. Whilst we do not directly download any images linked to the model cards, some model cards may include images. Some of these images may not be suitable for all audiences. ### Recommendations  ## Citation No formal citation is required for this dataset but if you use this dataset in your work, please include a link to this dataset page. ## Dataset Card Authors [@davanstrien](https://huggingface.co/davanstrien) ## Dataset Card Contact [@davanstrien](https://huggingface.co/davanstrien)

提供机构：

librarian-bots

原始信息汇总

数据集卡片概述

数据集描述

数据集详情

数据集名称: Hugging Face Hub Model Cards with Embeddings
数据集大小类别: 100K<n<1M
任务类别: 文本检索
数据集特征:
- modelId: 字符串类型
- author: 字符串类型
- last_modified: 时间戳类型（微秒，UTC时区）
- downloads: 64位整数类型
- likes: 64位整数类型
- library_name: 字符串类型
- tags: 字符串序列类型
- pipeline_tag: 字符串类型
- createdAt: 时间戳类型（微秒，UTC时区）
- card: 字符串类型
- embedding: 浮点32位序列类型
数据集分割:
- train: 包含442651个样本，总大小为2104883666.6585019字节
下载大小: 1243809305字节
数据集大小: 2104883666.6585019字节

数据集配置

配置名称: default
- 数据文件:
  - split: train
  - path: data/train-*

数据集创建

数据来源: Hugging Face Hub上的模型卡片README.md文件
数据收集和处理: 使用CRON作业每日下载数据
数据生产者: 模型卡片的创建者，包括社区中的各种人员，从大型公司到个人研究人员

数据集用途

潜在用途:
- 文本挖掘以发现模型卡片中的常见主题
- 分析模型卡片的格式/内容
- 模型卡片的话题建模
- 分析模型卡片的元数据
- 在模型卡片上训练语言模型
- 构建模型卡片的推荐系统
- 构建模型卡片的搜索引擎

数据集限制

数据集内容: 模型卡片由社区创建，内容不受控制，可能包含偏见或不适合所有受众的图像
个人和敏感信息: 未进行匿名化处理，模型卡片可能包含个人或敏感信息

数据集作者和联系人

数据集卡片作者: @davanstrien
数据集卡片联系人: @davanstrien

5,000+

优质数据集

54 个

任务类型

进入经典数据集