keenanpepper/fifty-thousand-things

Name: keenanpepper/fifty-thousand-things
Creator: keenanpepper
Published: 2025-12-10 00:00:36
License: 暂无描述

Hugging Face2025-12-10 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/keenanpepper/fifty-thousand-things

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation - question-answering language: - en tags: - wikipedia - topics - labels size_categories: - 10K<n<100K --- # The Fifty Thousand Things Dataset This dataset contains 49,637 topics with associated prompts and labels, derived from Wikipedia's Vital Articles Level 5. ## Dataset Structure Each record contains: - `original_title`: The original Wikipedia article title - `prompt`: A conversational prompt asking about the topic - `labels`: A list of 6-20 alternative phrasings/descriptions of the topic (average: 17) - `split`: Either "train" or "val" (90%/10% split) ### Example ```json { "original_title": "William Wallace", "prompt": "Tell me about William Wallace.", "labels": [ "William Wallace", "Scottish knight and independence leader", "William Wallace, the Scottish hero who led resistance against English rule in the 1290s", "Wallace, victor at the Battle of Stirling Bridge", "the Scottish patriot portrayed in Braveheart", "William Wallace, Guardian of Scotland during the Wars of Independence" ], "split": "train" } ``` ## Dataset Creation This dataset was created through a rigorous multi-stage process: 1. **Source**: Wikipedia Vital Articles Level 5 (50,006 articles) 2. **Generation**: 4 independent runs using Claude Sonnet 4.5 via Anthropic Batch API 3. **Prompt Selection**: Best prompts chosen from multiple generations via LLM evaluation 4. **Label Merging**: Labels from all 4 runs merged and deduplicated 5. **Coherence Filtering**: Each entry scored for label coherence (0-10 scale); only entries scoring 9-10 retained 6. **Train/Val Split**: Random 90%/10% split ### Quality Assurance - **High coherence**: All entries scored 9 or 10 out of 10 for label coherence - **Rich labels**: 6-20 diverse descriptions per topic (vs. 5 in the Level 4 dataset) - **Multiple generations**: Labels aggregated from 4 independent generation runs - **Curated prompts**: Best prompt selected from multiple options using LLM evaluation ## Usage ### Using the datasets library ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("keenanpepper/fifty-thousand-things") # Access train and validation splits train_data = dataset['train'] val_data = dataset['validation'] # Iterate through the data for item in train_data: print(f"Title: {item['original_title']}") print(f"Prompt: {item['prompt']}") print(f"Labels: {item['labels'][:3]}...") # Show first 3 labels print(f"Total labels: {len(item['labels'])}") break ``` ### Manual loading ```python import json # Load JSONL format topics = [] with open('wikipedia_vital_articles_level5_dataset.jsonl', 'r') as f: for line in f: topics.append(json.loads(line)) # Filter by split train_topics = [t for t in topics if t['split'] == 'train'] val_topics = [t for t in topics if t['split'] == 'val'] print(f"Train: {len(train_topics)} topics") print(f"Val: {len(val_topics)} topics") ``` ## Dataset Statistics - **Total topics**: 49,637 - **Train split**: 44,673 (90%) - **Validation split**: 4,964 (10%) - **Labels per topic**: 6-20 (avg: 17, median: 17) - **Source**: Wikipedia Vital Articles Level 5 - **Quality threshold**: Coherence score ≥ 9/10 ## Comparison with Level 4 Dataset This dataset is an expansion of [ten-thousand-things](https://huggingface.co/datasets/keenanpepper/ten-thousand-things): | Feature | Level 4 (ten-thousand-things) | Level 5 (fifty-thousand-things) | |---------|------------------------------|--------------------------------| | Topics | 10,008 | 49,637 | | Labels per topic | 5 | 6-20 (avg: 17) | | Generation runs | 1 | 4 (merged) | | Quality filtering | None | Coherence scoring ≥9/10 | | Prompt selection | Single generation | LLM-evaluated best prompts | | Train/val split | Single split | 90%/10% split | ## Use Cases - **Contrastive learning**: Training activation vectors for topic steering - **Topic modeling**: Multi-label topic classification - **Semantic similarity**: Learning different phrasings of the same concept - **Knowledge base construction**: Building topic ontologies - **Language model evaluation**: Testing topic recognition capabilities - **Few-shot learning**: Using rich label sets for prompt engineering ## Intended Use This dataset is designed for training language models to recognize and generate diverse descriptions of the same topic. The high-quality, coherent labels make it particularly suitable for: - Contrastive activation vector generation - Topic steering and control in language models - Semantic similarity and retrieval tasks - Multi-label text classification ## Limitations - Topics are limited to Wikipedia Vital Articles Level 5 - Labels are generated by Claude Sonnet 4.5 and may contain biases or inaccuracies - English language only - Some niche or technical topics may have less diverse label sets ## License MIT License ## Citation If you use this dataset, please cite: ```bibtex @dataset{fifty_thousand_things, author = {Keenan Pepper}, title = {The Fifty Thousand Things: Wikipedia Vital Articles Level 5 with Multi-Label Descriptions}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/keenanpepper/fifty-thousand-things} } ``` ## Acknowledgments - Source data: [Wikipedia Vital Articles](https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/5) - Generation: Claude Sonnet 4.5 via [Anthropic Batch API](https://www.anthropic.com/api) - Quality evaluation: LLM-based coherence scoring and prompt selection

提供机构：

keenanpepper

5,000+

优质数据集

54 个

任务类型

进入经典数据集