five

cstr/en-wiktionary-sqlite-all

收藏
Hugging Face2025-11-24 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/cstr/en-wiktionary-sqlite-all
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-3.0 task_categories: - text-retrieval language: - en tags: - wiktionary - dictionary - english - linguistics - morphology - semantics - normalized - lossless size_categories: - 1M<n<10M --- # English Wiktionary - TRULY LOSSLESS Normalized SQLite Database This is a **100% lossless, fully normalized** SQLite database of English Wiktionary, capturing EVERY field from the `cstr/en-wiktionary-extracted-all` dataset. ## 🎯 Key Features - **✅ 100% Lossless**: ALL fields captured including: - 🔗 **Wikilinks** in definitions (semantic connections) - 📝 **Qualifiers** (e.g., "archaic", "US", "informal") - 🏷️ **Sense IDs** (unique identifiers) - 🌐 **Wikidata IDs** (for semantic web linking) - 📚 **Attestations** (historical citations) - 🎭 **Head templates** (morphological data) - 📖 **Info templates** (structured metadata) - **⚡ Fast Queries**: Fully indexed schema for sub-20ms queries - **🔗 Complete Semantic Web**: All relations preserved with sense-level granularity - **📱 Mobile-ready**: Optimized for sqflite (Flutter) and local DB use cases ## 📊 Database Statistics - **Entries**: 1,427,190 - **Word Senses**: 1,705,177 - **Definitions (Glosses)**: 1,753,523 - **Wikilinks**: 3,476,476 - **Sense IDs**: 1,705,177 - **Qualifiers**: Embedded in senses - **Translations**: 3,394,497 - **Word Forms**: 944,222 - **Head Templates**: 1,424,086 - **Pronunciations**: 519,225 - **Examples**: 708,542 - **Attestations**: 12,434 - **Wikidata IDs**: 6,652 - **Synonyms**: 573,360 - **Antonyms**: 36,234 - **Hypernyms**: 23,416 - **Hyponyms**: 4,576 ## 🏗️ Database Schema (40+ Tables) ### New Tables (vs Previous Versions) - **head_templates**: Morphological templates - **entry_wikipedia**: Wikipedia cross-references - **sense_links**: Wikilinks in definitions - **sense_raw_tags**: Unstructured tags - **sense_wikidata**: Wikidata identifiers - **sense_wikipedia**: Wikipedia at sense level - **attestations**: Historical citations - **info_templates**: Structured metadata ### Core Tables - **entries**: Core word data with etymology - **senses**: Definitions with qualifier, senseid, head_nr - **translations**: Multi-language translations - **examples**: Usage examples - **semantic relations**: hypernyms/hyponyms/meronyms/holonyms/coordinate_terms ## 📖 Usage ### Download ```python from huggingface_hub import hf_hub_download import sqlite3 import gzip import shutil # Download compressed database db_gz_path = hf_hub_download( repo_id="cstr/en-wiktionary-sqlite-all", filename="en_wiktionary_normalized_all.db.gz", repo_type="dataset" ) # Decompress db_path = db_gz_path.replace('.gz', '') with gzip.open(db_gz_path, 'rb') as f_in: with open(db_path, 'wb') as f_out: shutil.copyfileobj(f_in, f_out) # Connect conn = sqlite3.connect(db_path) ``` ### Example Queries ```python # Get definition with wikilinks for "dog" cursor.execute(''' SELECT g.gloss_text, GROUP_CONCAT(l.link_text, ', ') as links FROM entries e JOIN senses s ON e.id = s.entry_id JOIN glosses g ON s.id = g.sense_id LEFT JOIN sense_links l ON s.id = l.sense_id WHERE e.word = ? AND e.lang = 'English' GROUP BY g.id ''', ('dog',)) # Get words with specific qualifier (e.g., "archaic") cursor.execute(''' SELECT e.word, s.qualifier, g.gloss_text FROM entries e JOIN senses s ON e.id = s.entry_id JOIN glosses g ON s.id = g.sense_id WHERE s.qualifier LIKE '%archaic%' LIMIT 10 ''') # Find Wikidata ID for a sense cursor.execute(''' SELECT e.word, w.wikidata_id FROM entries e JOIN senses s ON e.id = s.entry_id JOIN sense_wikidata w ON s.id = w.sense_id WHERE e.word = ? ''', ('cat',)) ``` ## 📜 License CC-BY-SA 4.0 (same as source) ## 🔄 Version This is a **truly lossless** version capturing all 40+ fields from the source data.
提供机构:
cstr
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作