keenanpepper/fifty-thousand-things
收藏Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/keenanpepper/fifty-thousand-things
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- question-answering
language:
- en
tags:
- wikipedia
- topics
- labels
size_categories:
- 10K<n<100K
---
# The Fifty Thousand Things Dataset
This dataset contains 49,637 topics with associated prompts and labels, derived from Wikipedia's Vital Articles Level 5.
## Dataset Structure
Each record contains:
- `original_title`: The original Wikipedia article title
- `prompt`: A conversational prompt asking about the topic
- `labels`: A list of 6-20 alternative phrasings/descriptions of the topic (average: 17)
- `split`: Either "train" or "val" (90%/10% split)
### Example
```json
{
"original_title": "William Wallace",
"prompt": "Tell me about William Wallace.",
"labels": [
"William Wallace",
"Scottish knight and independence leader",
"William Wallace, the Scottish hero who led resistance against English rule in the 1290s",
"Wallace, victor at the Battle of Stirling Bridge",
"the Scottish patriot portrayed in Braveheart",
"William Wallace, Guardian of Scotland during the Wars of Independence"
],
"split": "train"
}
```
## Dataset Creation
This dataset was created through a rigorous multi-stage process:
1. **Source**: Wikipedia Vital Articles Level 5 (50,006 articles)
2. **Generation**: 4 independent runs using Claude Sonnet 4.5 via Anthropic Batch API
3. **Prompt Selection**: Best prompts chosen from multiple generations via LLM evaluation
4. **Label Merging**: Labels from all 4 runs merged and deduplicated
5. **Coherence Filtering**: Each entry scored for label coherence (0-10 scale); only entries scoring 9-10 retained
6. **Train/Val Split**: Random 90%/10% split
### Quality Assurance
- **High coherence**: All entries scored 9 or 10 out of 10 for label coherence
- **Rich labels**: 6-20 diverse descriptions per topic (vs. 5 in the Level 4 dataset)
- **Multiple generations**: Labels aggregated from 4 independent generation runs
- **Curated prompts**: Best prompt selected from multiple options using LLM evaluation
## Usage
### Using the datasets library
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("keenanpepper/fifty-thousand-things")
# Access train and validation splits
train_data = dataset['train']
val_data = dataset['validation']
# Iterate through the data
for item in train_data:
print(f"Title: {item['original_title']}")
print(f"Prompt: {item['prompt']}")
print(f"Labels: {item['labels'][:3]}...") # Show first 3 labels
print(f"Total labels: {len(item['labels'])}")
break
```
### Manual loading
```python
import json
# Load JSONL format
topics = []
with open('wikipedia_vital_articles_level5_dataset.jsonl', 'r') as f:
for line in f:
topics.append(json.loads(line))
# Filter by split
train_topics = [t for t in topics if t['split'] == 'train']
val_topics = [t for t in topics if t['split'] == 'val']
print(f"Train: {len(train_topics)} topics")
print(f"Val: {len(val_topics)} topics")
```
## Dataset Statistics
- **Total topics**: 49,637
- **Train split**: 44,673 (90%)
- **Validation split**: 4,964 (10%)
- **Labels per topic**: 6-20 (avg: 17, median: 17)
- **Source**: Wikipedia Vital Articles Level 5
- **Quality threshold**: Coherence score ≥ 9/10
## Comparison with Level 4 Dataset
This dataset is an expansion of [ten-thousand-things](https://huggingface.co/datasets/keenanpepper/ten-thousand-things):
| Feature | Level 4 (ten-thousand-things) | Level 5 (fifty-thousand-things) |
|---------|------------------------------|--------------------------------|
| Topics | 10,008 | 49,637 |
| Labels per topic | 5 | 6-20 (avg: 17) |
| Generation runs | 1 | 4 (merged) |
| Quality filtering | None | Coherence scoring ≥9/10 |
| Prompt selection | Single generation | LLM-evaluated best prompts |
| Train/val split | Single split | 90%/10% split |
## Use Cases
- **Contrastive learning**: Training activation vectors for topic steering
- **Topic modeling**: Multi-label topic classification
- **Semantic similarity**: Learning different phrasings of the same concept
- **Knowledge base construction**: Building topic ontologies
- **Language model evaluation**: Testing topic recognition capabilities
- **Few-shot learning**: Using rich label sets for prompt engineering
## Intended Use
This dataset is designed for training language models to recognize and generate diverse descriptions of the same topic. The high-quality, coherent labels make it particularly suitable for:
- Contrastive activation vector generation
- Topic steering and control in language models
- Semantic similarity and retrieval tasks
- Multi-label text classification
## Limitations
- Topics are limited to Wikipedia Vital Articles Level 5
- Labels are generated by Claude Sonnet 4.5 and may contain biases or inaccuracies
- English language only
- Some niche or technical topics may have less diverse label sets
## License
MIT License
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{fifty_thousand_things,
author = {Keenan Pepper},
title = {The Fifty Thousand Things: Wikipedia Vital Articles Level 5 with Multi-Label Descriptions},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/keenanpepper/fifty-thousand-things}
}
```
## Acknowledgments
- Source data: [Wikipedia Vital Articles](https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/5)
- Generation: Claude Sonnet 4.5 via [Anthropic Batch API](https://www.anthropic.com/api)
- Quality evaluation: LLM-based coherence scoring and prompt selection
提供机构:
keenanpepper



