five

robworks-software/jeopardy-clues

收藏
Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/robworks-software/jeopardy-clues
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other license_name: fair-use-research license_link: https://en.wikipedia.org/wiki/Fair_use language: - en tags: - jeopardy - trivia - question-answering - quiz - game-show - nlp - text-classification - knowledge-base - tv-show - general-knowledge - tournament-of-champions - education pretty_name: "Jeopardy! Clue Dataset" size_categories: - 100K<n<1M task_categories: - question-answering - text-classification - text-generation configs: - config_name: default data_files: - split: train path: data/train-*.parquet - split: validation path: data/validation-*.parquet - split: test path: data/test-*.parquet dataset_info: features: - name: clue_id dtype: string - name: air_date dtype: string - name: season dtype: int64 - name: episode_id dtype: int64 - name: round dtype: string - name: category dtype: string - name: category_normalized dtype: string - name: value dtype: int64 - name: daily_double dtype: bool - name: clue_text dtype: string - name: answer dtype: string - name: clue_order dtype: int64 - name: category_frequency dtype: int64 - name: is_repeat_clue dtype: bool - name: repeat_clue_ids sequence: dtype: string - name: topic_tags sequence: dtype: string - name: answer_word_count dtype: int64 - name: clue_word_count dtype: int64 - name: notes dtype: string - name: sources sequence: dtype: string - name: source_conflicts dtype: string splits: - name: train num_examples: 482857 - name: validation num_examples: 42605 - name: test num_examples: 42606 --- # Jeopardy! Clue Dataset A comprehensive, deduplicated Jeopardy! clue dataset — **568,068 unique clues** spanning all 41 seasons (1984-2025), with **99.8% episode ID coverage** across regular season, Tournament of Champions, Teen/Kids Tournaments, Teachers Tournaments, Celebrity Jeopardy, College Championships, Invitationals, and all special events. Enriched with category analytics, repeat detection, difficulty estimates, NLP topic classification, and tournament/event metadata. Built by reconciling three sources (jwolle1, HuggingFace, J-Archive) with multi-pass deduplication and full provenance tracking. ## Dataset Summary | Metric | Value | |--------|-------| | Total clues | 568,068 | | Seasons covered | 1-41 (1984-2025) | | Unique episode dates | 9,146 | | Episode ID coverage | 99.8% | | Unique categories | 59,213 | | Topic-tagged clues | 212,048 (37%) | | Repeat clues detected | 11,630 | | Daily Doubles | 25,712+ | | Sources reconciled | 3 (jwolle1, HuggingFace, J-Archive) | ## Special Event & Tournament Coverage | Event | Episodes | |-------|----------| | Tournament of Champions | 778 | | College Championship | 497 | | Teen Tournament | 463 | | Celebrity Jeopardy! | 249 | | Teachers Tournament | 210 | | Jeopardy! Masters | 157 | | Kids Week | 109 | | Second Chance | 99 | | Invitational | 97 | | Battle of the Decades / Bay Area | 64 | | Power Players Week | 49 | | Million Dollar Celebrity | 49 | | High School Reunion | 30 | | Greatest of All Time | 26 | | All-Star Games | 19 | | Professors Tournament | 16 | | Super Jeopardy! | 14 | | IBM Watson Challenge | 2 | | Unaired Trebek Pilots (1983-84) | 2 | Tournament and event metadata is available in the `notes` field for each clue. ## Splits Stratified random split (85/7.5/7.5), stratified by season so each split has proportional representation from all 41 seasons and the full date range (1984-2025): | Split | Examples | % | |-------|----------|---| | `train` | 482,857 | 85.0% | | `validation` | 42,605 | 7.5% | | `test` | 42,606 | 7.5% | ## Features | Field | Type | Description | |-------|------|-------------| | `clue_id` | `string` | Deterministic SHA-256 hash of composite key | | `air_date` | `string` | Episode air date (`YYYY-MM-DD`) | | `season` | `int` | Jeopardy! season number (1-41) | | `episode_id` | `int` | Show number (99.8% coverage via J-Archive index) | | `round` | `string` | `jeopardy`, `double_jeopardy`, `final_jeopardy`, or `tiebreaker` | | `category` | `string` | Category name as displayed on the show | | `category_normalized` | `string` | Lowercased, whitespace-normalized for grouping | | `value` | `int` | Dollar value (null for Final Jeopardy / some Daily Doubles) | | `daily_double` | `bool` | Whether this clue was a Daily Double | | `clue_text` | `string` | The clue/prompt shown to contestants | | `answer` | `string` | The correct response | | `clue_order` | `int` | Position within category (1-5 for regular rounds, 1 for FJ) | | `category_frequency` | `int` | How many episodes this category has appeared in | | `is_repeat_clue` | `bool` | Whether a highly similar clue appeared in an earlier episode | | `repeat_clue_ids` | `list[string]` | IDs of earlier similar clues (TF-IDF cosine >= 0.85) | | `topic_tags` | `list[string]` | NLP-derived topic labels + `difficulty:N` (1-5 scale) | | `answer_word_count` | `int` | Word count of answer | | `clue_word_count` | `int` | Word count of clue text | | `notes` | `string` | Tournament/event metadata (e.g. "2022 Tournament of Champions semifinal game 2") | | `sources` | `list[string]` | Which source datasets provided this record | | `source_conflicts` | `string` | Field-level disagreements between sources (JSON) | ## Topic Tags Clues are classified into 17 topic categories using keyword and regex matching on both category names and clue text: `science` `biology` `space` `history` `war` `presidents` `geography` `us_states` `literature` `word_play` `movies` `television` `music` `sports` `food` `religion` `art` Each clue also has a `difficulty:N` tag (1=easiest, 5=hardest) estimated from dollar value, round, and Daily Double status. ## Top 10 Categories | Category | Episode Appearances | |----------|---------------------| | science | 355 | | american history | 337 | | business & industry | 323 | | literature | 320 | | history | 317 | | word origins | 305 | | sports | 298 | | world geography | 296 | | potpourri | 283 | | religion | 262 | ## Usage ```python from datasets import load_dataset dataset = load_dataset("robworks-software/jeopardy-clues") # Browse training data print(f"Training clues: {len(dataset['train'])}") print(dataset["train"][0]) # Filter by topic science_clues = dataset["train"].filter(lambda x: "science" in x["topic_tags"]) # Get Final Jeopardy clues final_jeopardy = dataset["train"].filter(lambda x: x["round"] == "final_jeopardy") # Find hardest clues (difficulty 5) hardest = dataset["train"].filter(lambda x: "difficulty:5" in x["topic_tags"]) # Filter Tournament of Champions clues toc = dataset["train"].filter(lambda x: "Tournament of Champions" in (x["notes"] or "")) print(f"ToC clues: {len(toc)}") # Category frequency analysis import collections cats = collections.Counter(dataset["train"]["category_normalized"]) print("Most common categories:", cats.most_common(10)) ``` ## Data Pipeline 1. **Ingest**: Three sources — jwolle1's GitHub release (529,939 regular + 21,592 kids/teen + 7,907 special events), openaccess-ai-collective/jeopardy from HuggingFace (216,930 clues), J-Archive scrape of incomplete episodes (5,745 clues) 2. **Reconcile**: Two-pass deduplication — composite-key matching across 3 sources, then text-level dedup to catch clues with identical text but different metadata across sources. 568,068 unique clues after removing 3,246 cross-source duplicates 3. **Episode ID Mapping**: J-Archive season index pages (41 seasons, 9,124 games) mapped show numbers to 99.8% of clues 4. **Enrich**: Category frequency stats (59,238 categories), TF-IDF repeat detection, value-based difficulty estimation, regex/keyword topic tagging (17 topics) 5. **Export**: Season-based train/validation/test splits as Parquet ## Known Gaps - **8 episodes with <50 clues**: Mostly early-season episodes (1985-1989) and special formats (IBM Watson Challenge) where some clues weren't preserved - **Season 36**: ~190 episodes due to COVID-19 production shutdown (March-June 2020) — not a data gap - **0.2% missing episode IDs**: ~1,200 clues (mainly from pre-season pilots and special events not in J-Archive's season index) - **`clue_order` defaults to 1**: For many clues where position within category was not recorded in the source data ## Use Cases - **Question answering**: Train QA models on trivia across dozens of knowledge domains - **Text classification**: Predict categories, difficulty, or topics from clue text - **Information retrieval**: Build trivia search engines with category and difficulty filtering - **NLP research**: Study question phrasing patterns, category evolution over 40+ years - **Educational tools**: Generate quiz content across curriculum-aligned topics - **Analytics**: Analyze trends in categories, difficulty, repeat patterns, and tournament structure over time - **Game AI**: Build Jeopardy-playing agents with difficulty-aware strategy ## Limitations - Clue text and answers are the intellectual property of Jeopardy Productions, Inc. - Some clues may have minor formatting differences between sources (tracked in `source_conflicts`) - Topic tags use rule-based classification; ~63% of clues have no topic tag beyond difficulty - Difficulty estimates are proxy-based (value tier, round) and do not reflect actual contestant performance - Daily Double identification depends on source data quality ## Legal Notice **Jeopardy!** is a registered trademark of Jeopardy Productions, Inc. All question content is the property of Jeopardy Productions, Inc. and Sony Pictures Television. This dataset is compiled from publicly available, community-maintained sources for **non-commercial research and educational purposes** under fair use ([17 U.S.C. 107](https://www.law.cornell.edu/uscode/text/17/107)). The compilation, enrichment, and statistical analysis represent original work. ### Sources & Attribution - [jwolle1/jeopardy_clue_dataset](https://github.com/jwolle1/jeopardy_clue_dataset) - Primary source (seasons 1-41, all tournaments and special events) - [openaccess-ai-collective/jeopardy](https://huggingface.co/datasets/openaccess-ai-collective/jeopardy) - Secondary source for cross-validation - [J-Archive](https://j-archive.com) - Fan-maintained episode archive, used for episode ID mapping and gap-filling This dataset should not be used for commercial purposes without appropriate licensing from Jeopardy Productions, Inc.
提供机构:
robworks-software
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作