CashlessConsumer/wikipedia-dyk-dataset
收藏Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/CashlessConsumer/wikipedia-dyk-dataset
下载链接
链接失效反馈官方服务:
资源简介:
# Wikipedia DYK (Did You Know) Dataset
<a href="https://huggingface.co/datasets"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue"></a>
A comprehensive database of Wikipedia's "Did You Know" (DYK) entries - interesting facts from Wikipedia's newest content that appeared on the Main Page.
## Quick Start
```python
import duckdb
# Connect to the dataset
conn = duckdb.connect('wikipedia-dyk.duckdb')
# Get a random fact
result = conn.execute("""
SELECT date, content, primary_article, domain
FROM dyk_entries
ORDER BY RANDOM()
LIMIT 1
""").fetchone()
print(f"{result[0]}: {result[1]}")
conn.close()
```
## Dataset Structure
### Tables
#### `dyk_entries`
| Column | Type | Description |
|--------|------|-------------|
| date | DATE | Date the fact appeared on Wikipedia's Main Page |
| raw_text | TEXT | Original wiki-formatted text of the DYK entry |
| content | TEXT | Clean plain text version of the fact |
| primary_article | VARCHAR | Main Wikipedia article the fact is about |
| article_links | JSON | Array of all linked articles with display text |
| categories | JSON | Inferred categories (e.g., ["Science", "Physics"]) |
| domain | VARCHAR | Broad domain classification |
| has_image | BOOLEAN | Whether the entry had an associated image |
| word_count | INTEGER | Word count of the content |
| source_url | VARCHAR | URL of the source archive page |
### Statistics
- **Total entries**: 118,258 (deduplicated)
- **Date range**: 2004-03-01 to 2026-03-20
- **Domains**: 14 categories
- **Quality**: 100% field coverage, verified
## Building the Database from Scratch
This section explains how to rebuild the entire dataset from the raw HTML source files.
### Prerequisites
```bash
pip install duckdb requests beautifulsoup4
```
### HTML Source Files
The `html_sources/` directory contains 267 raw Wikipedia DYK archive pages:
- `cache/*.html` - Monthly archives from Wikipedia
- `source/*.html` - Historical archives and recent additions
Each HTML file contains DYK entries in the format:
```html
<li>... that [interesting fact]?</li>
```
### Build Pipeline
Run the 8-step pipeline:
```bash
# Step 1: Initialize database
python ingest/init_db.py
# Step 2: Ingest from HTML sources
python ingest/ingest.py --source html_sources/
# Step 3: Backfill primary_article from source files
python ingest/backfill_from_source.py
# Step 4: Backfill red links (articles that didn't exist when DYK was published)
python ingest/backfill_redlinks.py
# Step 5: Augment with category/domain metadata
python ingest/augment_categories.py --all
# Step 6: Fix image marker inconsistencies
python ingest/fix_image_markers.py
# Step 7: Remove duplicate entries
python ingest/deduplicate.py
# Step 8: Verify quality
python ingest/verify_dataset.py
```
Or use the master pipeline script:
```bash
python ingest/build_pipeline.py --all # Full rebuild
python ingest/build_pipeline.py --verify-only # Verify existing data
```
### Pipeline Details
#### Step 1: Init DB
Creates `dyk_entries` table with indexes on `date` and `primary_article`.
#### Step 2: Ingest
Parses HTML files to extract:
- Entry text
- Date from section headers
- Wiki links from `<a href="/wiki/...">` tags
#### Step 3: Backfill from Source
For each entry missing `primary_article`:
1. Match by date (year+month from filename)
2. Normalize content (remove special chars, spaces)
3. Fuzzy match to find corresponding source entry
4. Extract article links from matched source
#### Step 4: Backfill Red Links
For remaining unmatched entries, look for:
- Standard wiki links: `/wiki/Article_Name`
- URL-encoded links: `/wiki/Article_Name_(film)`
#### Step 5: Augment Categories
Keyword-based classification into 40+ categories:
- Primary domains: Geography, History, Science, Arts, Sports, etc.
- Sub-categories: extracted from article names and content
#### Step 6: Fix Image Markers
Ensures `has_image` boolean is consistent with `(pictured)` / `(shown)` in text.
#### Step 7: Deduplicate
Removes duplicate entries based on `(date, content)` pairs, keeping first occurrence.
#### Step 8: Verify
Checks for:
- Null values in required fields
- Invalid JSON in article_links
- Duplicate entries
- Image marker consistency
## Publishing to HuggingFace
### 1. Create HuggingFace Dataset
```python
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id="your-username/wikipedia-dyk", repo_type="dataset", exist_ok=True)
```
### 2. Upload Data
```python
from huggingface_hub import HfApi
api = HfApi()
# Upload the DuckDB file
api.upload_file(
path_or_fileobj="wikipedia-dyk.duckdb",
path_in_repo="wikipedia-dyk.duckdb",
repo_id="your-org/wikipedia-dyk-dataset",
repo_type="dataset",
)
# Upload HTML sources (in batches for large files)
api.upload_folder(
folder_path="html_sources/",
repo_id="your-org/wikipedia-dyk-dataset",
repo_type="dataset",
ignore_patterns=["*.git*"],
)
```
### 3. Create Dataset Card
Create `README.md` with:
- Dataset description
- Citation info
- License (CC BY-SA 4.0 for Wikipedia content)
## License
Wikipedia DYK entries are sourced from Wikipedia and subject to CC BY-SA 4.0 license. Individual article links are owned by their respective creators.
## Acknowledgments
Data sourced from [Wikipedia:Did you know archive](https://en.wikipedia.org/wiki/Wikipedia:Did_you_know_archive).
提供机构:
CashlessConsumer



