davanstrien/hub-card-prompts
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/davanstrien/hub-card-prompts
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: prompt
dtype: string
- name: id
dtype: string
- name: kind
dtype: string
splits:
- name: train
num_examples: 5000
---
# Hub Card Prompts
Training prompts for distilling a Hugging Face card summarisation model (gemma-4-E2B-it student ← gemma-4-31B-it teacher).
## Source
Filtered subset of [librarian-bots/model_cards_with_metadata](https://huggingface.co/datasets/librarian-bots/model_cards_with_metadata) and [librarian-bots/dataset_cards_with_metadata](https://huggingface.co/datasets/librarian-bots/dataset_cards_with_metadata).
## Filter rules
1. **Minimum card length**: ≥ 300 characters (drops empty cards, template stubs, and near-empty entries)
2. **Maximum card length**: ≤ 15,000 characters (drops extremely long cards that would dominate token budget)
3. **Deduplication**: unique by (author, first 200 chars of card) — keeps the most-downloaded version when near-duplicates exist
4. **Auto-generated stub removal**: cards starting with `# Model Card for` are dropped
5. **Sampling**: 2,500 model cards + 2,500 dataset cards, randomly shuffled after sorting by downloads descending
## Prompt template
```
You are generating a TL;DR for a Hugging Face {kind} card. Write a single paragraph that covers:
- What the {kind} is and what it does
- Key technical details (architecture, size, training data if mentioned)
- How to use it (load code snippet if available)
{kind}: {id}
{card}
```
## Resulting dataset
- 5,000 rows total (2,500 model + 2,500 dataset cards)
- Each row has a `messages` column (single user-turn chat format) and a `prompt` column (plain text)
- Designed for pure on-policy distillation (`lmbda=1.0`) — no assistant completions included
提供机构:
davanstrien



