robworks-software/jeopardy-clues
收藏Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/robworks-software/jeopardy-clues
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
license_name: fair-use-research
license_link: https://en.wikipedia.org/wiki/Fair_use
language:
- en
tags:
- jeopardy
- trivia
- question-answering
- quiz
- game-show
- nlp
- text-classification
- knowledge-base
- tv-show
- general-knowledge
- tournament-of-champions
- education
pretty_name: "Jeopardy! Clue Dataset"
size_categories:
- 100K<n<1M
task_categories:
- question-answering
- text-classification
- text-generation
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
- split: validation
path: data/validation-*.parquet
- split: test
path: data/test-*.parquet
dataset_info:
features:
- name: clue_id
dtype: string
- name: air_date
dtype: string
- name: season
dtype: int64
- name: episode_id
dtype: int64
- name: round
dtype: string
- name: category
dtype: string
- name: category_normalized
dtype: string
- name: value
dtype: int64
- name: daily_double
dtype: bool
- name: clue_text
dtype: string
- name: answer
dtype: string
- name: clue_order
dtype: int64
- name: category_frequency
dtype: int64
- name: is_repeat_clue
dtype: bool
- name: repeat_clue_ids
sequence:
dtype: string
- name: topic_tags
sequence:
dtype: string
- name: answer_word_count
dtype: int64
- name: clue_word_count
dtype: int64
- name: notes
dtype: string
- name: sources
sequence:
dtype: string
- name: source_conflicts
dtype: string
splits:
- name: train
num_examples: 482857
- name: validation
num_examples: 42605
- name: test
num_examples: 42606
---
# Jeopardy! Clue Dataset
A comprehensive, deduplicated Jeopardy! clue dataset — **568,068 unique clues** spanning all 41 seasons (1984-2025), with **99.8% episode ID coverage** across regular season, Tournament of Champions, Teen/Kids Tournaments, Teachers Tournaments, Celebrity Jeopardy, College Championships, Invitationals, and all special events.
Enriched with category analytics, repeat detection, difficulty estimates, NLP topic classification, and tournament/event metadata. Built by reconciling three sources (jwolle1, HuggingFace, J-Archive) with multi-pass deduplication and full provenance tracking.
## Dataset Summary
| Metric | Value |
|--------|-------|
| Total clues | 568,068 |
| Seasons covered | 1-41 (1984-2025) |
| Unique episode dates | 9,146 |
| Episode ID coverage | 99.8% |
| Unique categories | 59,213 |
| Topic-tagged clues | 212,048 (37%) |
| Repeat clues detected | 11,630 |
| Daily Doubles | 25,712+ |
| Sources reconciled | 3 (jwolle1, HuggingFace, J-Archive) |
## Special Event & Tournament Coverage
| Event | Episodes |
|-------|----------|
| Tournament of Champions | 778 |
| College Championship | 497 |
| Teen Tournament | 463 |
| Celebrity Jeopardy! | 249 |
| Teachers Tournament | 210 |
| Jeopardy! Masters | 157 |
| Kids Week | 109 |
| Second Chance | 99 |
| Invitational | 97 |
| Battle of the Decades / Bay Area | 64 |
| Power Players Week | 49 |
| Million Dollar Celebrity | 49 |
| High School Reunion | 30 |
| Greatest of All Time | 26 |
| All-Star Games | 19 |
| Professors Tournament | 16 |
| Super Jeopardy! | 14 |
| IBM Watson Challenge | 2 |
| Unaired Trebek Pilots (1983-84) | 2 |
Tournament and event metadata is available in the `notes` field for each clue.
## Splits
Stratified random split (85/7.5/7.5), stratified by season so each split has proportional representation from all 41 seasons and the full date range (1984-2025):
| Split | Examples | % |
|-------|----------|---|
| `train` | 482,857 | 85.0% |
| `validation` | 42,605 | 7.5% |
| `test` | 42,606 | 7.5% |
## Features
| Field | Type | Description |
|-------|------|-------------|
| `clue_id` | `string` | Deterministic SHA-256 hash of composite key |
| `air_date` | `string` | Episode air date (`YYYY-MM-DD`) |
| `season` | `int` | Jeopardy! season number (1-41) |
| `episode_id` | `int` | Show number (99.8% coverage via J-Archive index) |
| `round` | `string` | `jeopardy`, `double_jeopardy`, `final_jeopardy`, or `tiebreaker` |
| `category` | `string` | Category name as displayed on the show |
| `category_normalized` | `string` | Lowercased, whitespace-normalized for grouping |
| `value` | `int` | Dollar value (null for Final Jeopardy / some Daily Doubles) |
| `daily_double` | `bool` | Whether this clue was a Daily Double |
| `clue_text` | `string` | The clue/prompt shown to contestants |
| `answer` | `string` | The correct response |
| `clue_order` | `int` | Position within category (1-5 for regular rounds, 1 for FJ) |
| `category_frequency` | `int` | How many episodes this category has appeared in |
| `is_repeat_clue` | `bool` | Whether a highly similar clue appeared in an earlier episode |
| `repeat_clue_ids` | `list[string]` | IDs of earlier similar clues (TF-IDF cosine >= 0.85) |
| `topic_tags` | `list[string]` | NLP-derived topic labels + `difficulty:N` (1-5 scale) |
| `answer_word_count` | `int` | Word count of answer |
| `clue_word_count` | `int` | Word count of clue text |
| `notes` | `string` | Tournament/event metadata (e.g. "2022 Tournament of Champions semifinal game 2") |
| `sources` | `list[string]` | Which source datasets provided this record |
| `source_conflicts` | `string` | Field-level disagreements between sources (JSON) |
## Topic Tags
Clues are classified into 17 topic categories using keyword and regex matching on both category names and clue text:
`science` `biology` `space` `history` `war` `presidents` `geography` `us_states` `literature` `word_play` `movies` `television` `music` `sports` `food` `religion` `art`
Each clue also has a `difficulty:N` tag (1=easiest, 5=hardest) estimated from dollar value, round, and Daily Double status.
## Top 10 Categories
| Category | Episode Appearances |
|----------|---------------------|
| science | 355 |
| american history | 337 |
| business & industry | 323 |
| literature | 320 |
| history | 317 |
| word origins | 305 |
| sports | 298 |
| world geography | 296 |
| potpourri | 283 |
| religion | 262 |
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("robworks-software/jeopardy-clues")
# Browse training data
print(f"Training clues: {len(dataset['train'])}")
print(dataset["train"][0])
# Filter by topic
science_clues = dataset["train"].filter(lambda x: "science" in x["topic_tags"])
# Get Final Jeopardy clues
final_jeopardy = dataset["train"].filter(lambda x: x["round"] == "final_jeopardy")
# Find hardest clues (difficulty 5)
hardest = dataset["train"].filter(lambda x: "difficulty:5" in x["topic_tags"])
# Filter Tournament of Champions clues
toc = dataset["train"].filter(lambda x: "Tournament of Champions" in (x["notes"] or ""))
print(f"ToC clues: {len(toc)}")
# Category frequency analysis
import collections
cats = collections.Counter(dataset["train"]["category_normalized"])
print("Most common categories:", cats.most_common(10))
```
## Data Pipeline
1. **Ingest**: Three sources — jwolle1's GitHub release (529,939 regular + 21,592 kids/teen + 7,907 special events), openaccess-ai-collective/jeopardy from HuggingFace (216,930 clues), J-Archive scrape of incomplete episodes (5,745 clues)
2. **Reconcile**: Two-pass deduplication — composite-key matching across 3 sources, then text-level dedup to catch clues with identical text but different metadata across sources. 568,068 unique clues after removing 3,246 cross-source duplicates
3. **Episode ID Mapping**: J-Archive season index pages (41 seasons, 9,124 games) mapped show numbers to 99.8% of clues
4. **Enrich**: Category frequency stats (59,238 categories), TF-IDF repeat detection, value-based difficulty estimation, regex/keyword topic tagging (17 topics)
5. **Export**: Season-based train/validation/test splits as Parquet
## Known Gaps
- **8 episodes with <50 clues**: Mostly early-season episodes (1985-1989) and special formats (IBM Watson Challenge) where some clues weren't preserved
- **Season 36**: ~190 episodes due to COVID-19 production shutdown (March-June 2020) — not a data gap
- **0.2% missing episode IDs**: ~1,200 clues (mainly from pre-season pilots and special events not in J-Archive's season index)
- **`clue_order` defaults to 1**: For many clues where position within category was not recorded in the source data
## Use Cases
- **Question answering**: Train QA models on trivia across dozens of knowledge domains
- **Text classification**: Predict categories, difficulty, or topics from clue text
- **Information retrieval**: Build trivia search engines with category and difficulty filtering
- **NLP research**: Study question phrasing patterns, category evolution over 40+ years
- **Educational tools**: Generate quiz content across curriculum-aligned topics
- **Analytics**: Analyze trends in categories, difficulty, repeat patterns, and tournament structure over time
- **Game AI**: Build Jeopardy-playing agents with difficulty-aware strategy
## Limitations
- Clue text and answers are the intellectual property of Jeopardy Productions, Inc.
- Some clues may have minor formatting differences between sources (tracked in `source_conflicts`)
- Topic tags use rule-based classification; ~63% of clues have no topic tag beyond difficulty
- Difficulty estimates are proxy-based (value tier, round) and do not reflect actual contestant performance
- Daily Double identification depends on source data quality
## Legal Notice
**Jeopardy!** is a registered trademark of Jeopardy Productions, Inc. All question content is the property of Jeopardy Productions, Inc. and Sony Pictures Television.
This dataset is compiled from publicly available, community-maintained sources for **non-commercial research and educational purposes** under fair use ([17 U.S.C. 107](https://www.law.cornell.edu/uscode/text/17/107)). The compilation, enrichment, and statistical analysis represent original work.
### Sources & Attribution
- [jwolle1/jeopardy_clue_dataset](https://github.com/jwolle1/jeopardy_clue_dataset) - Primary source (seasons 1-41, all tournaments and special events)
- [openaccess-ai-collective/jeopardy](https://huggingface.co/datasets/openaccess-ai-collective/jeopardy) - Secondary source for cross-validation
- [J-Archive](https://j-archive.com) - Fan-maintained episode archive, used for episode ID mapping and gap-filling
This dataset should not be used for commercial purposes without appropriate licensing from Jeopardy Productions, Inc.
提供机构:
robworks-software



