cstr/en-wiktionary-sqlite-full

Name: cstr/en-wiktionary-sqlite-full
Creator: cstr
Published: 2025-11-22 06:41:45
License: 暂无描述

Hugging Face2025-11-22 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/cstr/en-wiktionary-sqlite-full

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 task_categories: - text-retrieval language: - en tags: - wiktionary - dictionary - english - linguistics - morphology - semantics - normalized size_categories: - 1M<n<10M --- # English Wiktionary - Normalized SQLite Database This is a normalized SQLite database of English Wiktionary, capturing every field from the `cstr/en-wiktionary-extracted` dataset. Note that this does **not** include all that would be extractable per wiktextract with `--all`, like translations, examples, etc. ## 🎯 Key Features - fields captured including: - 🔗 **Wikilinks** in definitions (semantic connections) - 📝 **Qualifiers** (e.g., "archaic", "US", "informal") - 🏷️ **Sense IDs** (unique identifiers) - 🌐 **Wikidata IDs** (for semantic web linking) - 📚 **Attestations** (historical citations) - 🎭 **Head templates** (morphological data) - 📖 **Info templates** (structured metadata) - **⚡ Fast Queries**: Fully indexed schema for sub-20ms queries - **🔗 Semantic Web**: relations preserved with sense-level granularity - **📱 Mobile-ready**: Optimized for sq(f)lite (Flutter) and local DB use cases ## 📊 Database Statistics - **Entries**: 1,243,200 - **Word Senses**: 1,361,968 - **Definitions (Glosses)**: 1,381,486 - **Wikilinks**: 2,585,821 - **Sense IDs**: 1,361,968 - **Qualifiers**: Embedded in senses - **Translations**: 0 - **Word Forms**: 700,191 - **Head Templates**: 1,237,679 - **Pronunciations**: 0 - **Examples**: 0 - **Attestations**: 4,295 - **Wikidata IDs**: 2,309 - **Synonyms**: 214,838 - **Antonyms**: 11,816 - **Hypernyms**: 9,818 - **Hyponyms**: 22,649 - ## 🏗️ Database Schema (40+ Tables) ### New Tables (vs Previous Versions) - **head_templates**: Morphological templates - **entry_wikipedia**: Wikipedia cross-references - **sense_links**: Wikilinks in definitions - **sense_raw_tags**: Unstructured tags - **sense_wikidata**: Wikidata identifiers - **sense_wikipedia**: Wikipedia at sense level - **attestations**: Historical citations - **info_templates**: Structured metadata ### Core Tables - **entries**: Core word data with etymology - **senses**: Definitions with qualifier, senseid, head_nr - **translations**: Multi-language translations - **examples**: Usage examples - **semantic relations**: hypernyms/hyponyms/meronyms/holonyms/coordinate_terms ## 📖 Usage ### Download ```python from huggingface_hub import hf_hub_download import sqlite3 import gzip import shutil # Download compressed database db_gz_path = hf_hub_download( repo_id="cstr/en-wiktionary-sqlite-full", filename="en_wiktionary_normalized_full.db.gz", repo_type="dataset" ) # Decompress db_path = db_gz_path.replace('.gz', '') with gzip.open(db_gz_path, 'rb') as f_in: with open(db_path, 'wb') as f_out: shutil.copyfileobj(f_in, f_out) # Connect conn = sqlite3.connect(db_path) ``` ### Example Queries ```python # Get definition with wikilinks for "dog" cursor.execute(''' SELECT g.gloss_text, GROUP_CONCAT(l.link_text, ', ') as links FROM entries e JOIN senses s ON e.id = s.entry_id JOIN glosses g ON s.id = g.sense_id LEFT JOIN sense_links l ON s.id = l.sense_id WHERE e.word = ? AND e.lang = 'English' GROUP BY g.id ''', ('dog',)) # Get words with specific qualifier (e.g., "archaic") cursor.execute(''' SELECT e.word, s.qualifier, g.gloss_text FROM entries e JOIN senses s ON e.id = s.entry_id JOIN glosses g ON s.id = g.sense_id WHERE s.qualifier LIKE '%archaic%' LIMIT 10 ''') # Find Wikidata ID for a sense cursor.execute(''' SELECT e.word, w.wikidata_id FROM entries e JOIN senses s ON e.id = s.entry_id JOIN sense_wikidata w ON s.id = w.sense_id WHERE e.word = ? ''', ('cat',)) ``` ## 📜 License CC-BY-SA 4.0 (same as source) ## 🔄 Version This is a **truly lossless** version capturing all 40+ fields from the source data.

> 许可证：CC BY-SA 4.0 > 任务类别： > - 文本检索（text-retrieval） > 语言： > - 英语（en） > 标签： > - 维基词典（Wiktionary） > - 词典（dictionary） > - 英语（English） > - 语言学（linguistics） > - 词形学（morphology） > - 语义学（semantics） > - 标准化（normalized） > 规模类别： > - 100万 < n < 1000万 # 英语维基词典（English Wiktionary）——标准化SQLite数据库本数据集为英语维基词典的标准化SQLite数据库，完整收录`cstr/en-wiktionary-extracted`数据集的全部字段。请注意，本数据集未包含使用`wiktextract`工具配合`--all`参数可提取的全部内容，例如译文、例句等。 ## 🎯 核心特性已收录的字段包括： - 🔗 **释义内维基链接（语义关联）**：定义文本中的维基链接，用于建立语义连接 - 📝 **限定词**：例如“古旧”“美式”“非正式”等语言使用场景标注 - 🏷️ **义项编号**：唯一标识符 - 🌐 **维基数据编号**：用于语义网链接的标识符 - 📚 **书证**：历史引用文献 - 🎭 **词头模板**：词形学相关数据 - 📖 **信息模板**：结构化元数据 - **⚡ 极速查询**：采用全索引架构，单条查询延迟低于20毫秒 - **🔗 语义网支持**：保留义项级粒度的语义关系 - **📱 移动端适配**：针对sq(f)lite（Flutter）及本地数据库场景优化 ## 📊 数据库统计信息 - **词项条目数**：1,243,200 - **词项义项数**：1,361,968 - **释义（释文）数**：1,381,486 - **维基链接数**：2,585,821 - **义项编号数**：1,361,968 - **限定词**：内嵌于义项数据中 - **译文数**：0 - **词形变体数**：700,191 - **词头模板数**：1,237,679 - **发音数据**：0 - **例句数**：0 - **书证数**：4,295 - **维基数据编号数**：2,309 - **同义词数**：214,838 - **反义词数**：11,816 - **上位词数**：9,818 - **下位词数**：22,649 ## 🏗️ 数据库架构（共40+张表） ### 相较于旧版的新增表 - **head_templates**：词形学模板表 - **entry_wikipedia**：维基百科交叉引用表 - **sense_links**：释义内维基链接表 - **sense_raw_tags**：非结构化标签表 - **sense_wikidata**：维基数据标识符表 - **sense_wikipedia**：义项级维基百科关联表 - **attestations**：书证表 - **info_templates**：结构化元数据表 ### 核心表 - **entries**：包含词源信息的核心词项数据表 - **senses**：包含限定词、义项编号、词头序号的释义表 - **translations**：多语言译义词表 - **examples**：使用例句表 - **semantic relations**：语义关系表（涵盖上位词、下位词、部分词、整体词及并列词） ## 📖 使用指南 ### 下载 python from huggingface_hub import hf_hub_download import sqlite3 import gzip import shutil # 下载压缩后的数据库文件 db_gz_path = hf_hub_download( repo_id="cstr/en-wiktionary-sqlite-full", filename="en_wiktionary_normalized_full.db.gz", repo_type="dataset" ) # 解压文件 db_path = db_gz_path.replace('.gz', '') with gzip.open(db_gz_path, 'rb') as f_in: with open(db_path, 'wb') as f_out: shutil.copyfileobj(f_in, f_out) # 建立数据库连接 conn = sqlite3.connect(db_path) ### 示例查询 python # 获取“dog”的带维基链接的释义 cursor.execute(''' SELECT g.gloss_text, GROUP_CONCAT(l.link_text, ', ') as links FROM entries e JOIN senses s ON e.id = s.entry_id JOIN glosses g ON s.id = g.sense_id LEFT JOIN sense_links l ON s.id = l.sense_id WHERE e.word = ? AND e.lang = 'English' GROUP BY g.id ''', ('dog',)) # 获取带有特定限定词（如“古旧”）的词项 cursor.execute(''' SELECT e.word, s.qualifier, g.gloss_text FROM entries e JOIN senses s ON e.id = s.entry_id JOIN glosses g ON s.id = g.sense_id WHERE s.qualifier LIKE '%archaic%' LIMIT 10 ''') # 获取某词项义项对应的维基数据编号 cursor.execute(''' SELECT e.word, w.wikidata_id FROM entries e JOIN senses s ON e.id = s.entry_id JOIN sense_wikidata w ON s.id = w.sense_id WHERE e.word = ? ''', ('cat',)) ## 📜 许可证本数据集采用与源数据一致的CC-BY-SA 4.0协议。 ## 🔄 版本说明本版本为**完全无损**版本，完整收录源数据集的40+个字段。

提供机构：

cstr

5,000+

优质数据集

54 个

任务类型

进入经典数据集