five

CashlessConsumer/wikipedia-dyk-dataset

收藏
Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/CashlessConsumer/wikipedia-dyk-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
# Wikipedia DYK (Did You Know) Dataset <a href="https://huggingface.co/datasets"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue"></a> A comprehensive database of Wikipedia's "Did You Know" (DYK) entries - interesting facts from Wikipedia's newest content that appeared on the Main Page. ## Quick Start ```python import duckdb # Connect to the dataset conn = duckdb.connect('wikipedia-dyk.duckdb') # Get a random fact result = conn.execute(""" SELECT date, content, primary_article, domain FROM dyk_entries ORDER BY RANDOM() LIMIT 1 """).fetchone() print(f"{result[0]}: {result[1]}") conn.close() ``` ## Dataset Structure ### Tables #### `dyk_entries` | Column | Type | Description | |--------|------|-------------| | date | DATE | Date the fact appeared on Wikipedia's Main Page | | raw_text | TEXT | Original wiki-formatted text of the DYK entry | | content | TEXT | Clean plain text version of the fact | | primary_article | VARCHAR | Main Wikipedia article the fact is about | | article_links | JSON | Array of all linked articles with display text | | categories | JSON | Inferred categories (e.g., ["Science", "Physics"]) | | domain | VARCHAR | Broad domain classification | | has_image | BOOLEAN | Whether the entry had an associated image | | word_count | INTEGER | Word count of the content | | source_url | VARCHAR | URL of the source archive page | ### Statistics - **Total entries**: 118,258 (deduplicated) - **Date range**: 2004-03-01 to 2026-03-20 - **Domains**: 14 categories - **Quality**: 100% field coverage, verified ## Building the Database from Scratch This section explains how to rebuild the entire dataset from the raw HTML source files. ### Prerequisites ```bash pip install duckdb requests beautifulsoup4 ``` ### HTML Source Files The `html_sources/` directory contains 267 raw Wikipedia DYK archive pages: - `cache/*.html` - Monthly archives from Wikipedia - `source/*.html` - Historical archives and recent additions Each HTML file contains DYK entries in the format: ```html <li>... that [interesting fact]?</li> ``` ### Build Pipeline Run the 8-step pipeline: ```bash # Step 1: Initialize database python ingest/init_db.py # Step 2: Ingest from HTML sources python ingest/ingest.py --source html_sources/ # Step 3: Backfill primary_article from source files python ingest/backfill_from_source.py # Step 4: Backfill red links (articles that didn't exist when DYK was published) python ingest/backfill_redlinks.py # Step 5: Augment with category/domain metadata python ingest/augment_categories.py --all # Step 6: Fix image marker inconsistencies python ingest/fix_image_markers.py # Step 7: Remove duplicate entries python ingest/deduplicate.py # Step 8: Verify quality python ingest/verify_dataset.py ``` Or use the master pipeline script: ```bash python ingest/build_pipeline.py --all # Full rebuild python ingest/build_pipeline.py --verify-only # Verify existing data ``` ### Pipeline Details #### Step 1: Init DB Creates `dyk_entries` table with indexes on `date` and `primary_article`. #### Step 2: Ingest Parses HTML files to extract: - Entry text - Date from section headers - Wiki links from `<a href="/wiki/...">` tags #### Step 3: Backfill from Source For each entry missing `primary_article`: 1. Match by date (year+month from filename) 2. Normalize content (remove special chars, spaces) 3. Fuzzy match to find corresponding source entry 4. Extract article links from matched source #### Step 4: Backfill Red Links For remaining unmatched entries, look for: - Standard wiki links: `/wiki/Article_Name` - URL-encoded links: `/wiki/Article_Name_(film)` #### Step 5: Augment Categories Keyword-based classification into 40+ categories: - Primary domains: Geography, History, Science, Arts, Sports, etc. - Sub-categories: extracted from article names and content #### Step 6: Fix Image Markers Ensures `has_image` boolean is consistent with `(pictured)` / `(shown)` in text. #### Step 7: Deduplicate Removes duplicate entries based on `(date, content)` pairs, keeping first occurrence. #### Step 8: Verify Checks for: - Null values in required fields - Invalid JSON in article_links - Duplicate entries - Image marker consistency ## Publishing to HuggingFace ### 1. Create HuggingFace Dataset ```python from huggingface_hub import HfApi api = HfApi() api.create_repo(repo_id="your-username/wikipedia-dyk", repo_type="dataset", exist_ok=True) ``` ### 2. Upload Data ```python from huggingface_hub import HfApi api = HfApi() # Upload the DuckDB file api.upload_file( path_or_fileobj="wikipedia-dyk.duckdb", path_in_repo="wikipedia-dyk.duckdb", repo_id="your-org/wikipedia-dyk-dataset", repo_type="dataset", ) # Upload HTML sources (in batches for large files) api.upload_folder( folder_path="html_sources/", repo_id="your-org/wikipedia-dyk-dataset", repo_type="dataset", ignore_patterns=["*.git*"], ) ``` ### 3. Create Dataset Card Create `README.md` with: - Dataset description - Citation info - License (CC BY-SA 4.0 for Wikipedia content) ## License Wikipedia DYK entries are sourced from Wikipedia and subject to CC BY-SA 4.0 license. Individual article links are owned by their respective creators. ## Acknowledgments Data sourced from [Wikipedia:Did you know archive](https://en.wikipedia.org/wiki/Wikipedia:Did_you_know_archive).
提供机构:
CashlessConsumer
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作