five

davanstrien/hub-card-prompts

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/davanstrien/hub-card-prompts
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: messages list: - name: content dtype: string - name: role dtype: string - name: prompt dtype: string - name: id dtype: string - name: kind dtype: string splits: - name: train num_examples: 5000 --- # Hub Card Prompts Training prompts for distilling a Hugging Face card summarisation model (gemma-4-E2B-it student ← gemma-4-31B-it teacher). ## Source Filtered subset of [librarian-bots/model_cards_with_metadata](https://huggingface.co/datasets/librarian-bots/model_cards_with_metadata) and [librarian-bots/dataset_cards_with_metadata](https://huggingface.co/datasets/librarian-bots/dataset_cards_with_metadata). ## Filter rules 1. **Minimum card length**: ≥ 300 characters (drops empty cards, template stubs, and near-empty entries) 2. **Maximum card length**: ≤ 15,000 characters (drops extremely long cards that would dominate token budget) 3. **Deduplication**: unique by (author, first 200 chars of card) — keeps the most-downloaded version when near-duplicates exist 4. **Auto-generated stub removal**: cards starting with `# Model Card for` are dropped 5. **Sampling**: 2,500 model cards + 2,500 dataset cards, randomly shuffled after sorting by downloads descending ## Prompt template ``` You are generating a TL;DR for a Hugging Face {kind} card. Write a single paragraph that covers: - What the {kind} is and what it does - Key technical details (architecture, size, training data if mentioned) - How to use it (load code snippet if available) {kind}: {id} {card} ``` ## Resulting dataset - 5,000 rows total (2,500 model + 2,500 dataset cards) - Each row has a `messages` column (single user-turn chat format) and a `prompt` column (plain text) - Designed for pure on-policy distillation (`lmbda=1.0`) — no assistant completions included
提供机构:
davanstrien
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作