five

cstr/en-wiktionary-sqlite-full

收藏
Hugging Face2025-11-22 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/cstr/en-wiktionary-sqlite-full
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 task_categories: - text-retrieval language: - en tags: - wiktionary - dictionary - english - linguistics - morphology - semantics - normalized size_categories: - 1M<n<10M --- # English Wiktionary - Normalized SQLite Database This is a normalized SQLite database of English Wiktionary, capturing every field from the `cstr/en-wiktionary-extracted` dataset. Note that this does **not** include all that would be extractable per wiktextract with `--all`, like translations, examples, etc. ## 🎯 Key Features - fields captured including: - 🔗 **Wikilinks** in definitions (semantic connections) - 📝 **Qualifiers** (e.g., "archaic", "US", "informal") - 🏷️ **Sense IDs** (unique identifiers) - 🌐 **Wikidata IDs** (for semantic web linking) - 📚 **Attestations** (historical citations) - 🎭 **Head templates** (morphological data) - 📖 **Info templates** (structured metadata) - **⚡ Fast Queries**: Fully indexed schema for sub-20ms queries - **🔗 Semantic Web**: relations preserved with sense-level granularity - **📱 Mobile-ready**: Optimized for sq(f)lite (Flutter) and local DB use cases ## 📊 Database Statistics - **Entries**: 1,243,200 - **Word Senses**: 1,361,968 - **Definitions (Glosses)**: 1,381,486 - **Wikilinks**: 2,585,821 - **Sense IDs**: 1,361,968 - **Qualifiers**: Embedded in senses - **Translations**: 0 - **Word Forms**: 700,191 - **Head Templates**: 1,237,679 - **Pronunciations**: 0 - **Examples**: 0 - **Attestations**: 4,295 - **Wikidata IDs**: 2,309 - **Synonyms**: 214,838 - **Antonyms**: 11,816 - **Hypernyms**: 9,818 - **Hyponyms**: 22,649 - ## 🏗️ Database Schema (40+ Tables) ### New Tables (vs Previous Versions) - **head_templates**: Morphological templates - **entry_wikipedia**: Wikipedia cross-references - **sense_links**: Wikilinks in definitions - **sense_raw_tags**: Unstructured tags - **sense_wikidata**: Wikidata identifiers - **sense_wikipedia**: Wikipedia at sense level - **attestations**: Historical citations - **info_templates**: Structured metadata ### Core Tables - **entries**: Core word data with etymology - **senses**: Definitions with qualifier, senseid, head_nr - **translations**: Multi-language translations - **examples**: Usage examples - **semantic relations**: hypernyms/hyponyms/meronyms/holonyms/coordinate_terms ## 📖 Usage ### Download ```python from huggingface_hub import hf_hub_download import sqlite3 import gzip import shutil # Download compressed database db_gz_path = hf_hub_download( repo_id="cstr/en-wiktionary-sqlite-full", filename="en_wiktionary_normalized_full.db.gz", repo_type="dataset" ) # Decompress db_path = db_gz_path.replace('.gz', '') with gzip.open(db_gz_path, 'rb') as f_in: with open(db_path, 'wb') as f_out: shutil.copyfileobj(f_in, f_out) # Connect conn = sqlite3.connect(db_path) ``` ### Example Queries ```python # Get definition with wikilinks for "dog" cursor.execute(''' SELECT g.gloss_text, GROUP_CONCAT(l.link_text, ', ') as links FROM entries e JOIN senses s ON e.id = s.entry_id JOIN glosses g ON s.id = g.sense_id LEFT JOIN sense_links l ON s.id = l.sense_id WHERE e.word = ? AND e.lang = 'English' GROUP BY g.id ''', ('dog',)) # Get words with specific qualifier (e.g., "archaic") cursor.execute(''' SELECT e.word, s.qualifier, g.gloss_text FROM entries e JOIN senses s ON e.id = s.entry_id JOIN glosses g ON s.id = g.sense_id WHERE s.qualifier LIKE '%archaic%' LIMIT 10 ''') # Find Wikidata ID for a sense cursor.execute(''' SELECT e.word, w.wikidata_id FROM entries e JOIN senses s ON e.id = s.entry_id JOIN sense_wikidata w ON s.id = w.sense_id WHERE e.word = ? ''', ('cat',)) ``` ## 📜 License CC-BY-SA 4.0 (same as source) ## 🔄 Version This is a **truly lossless** version capturing all 40+ fields from the source data.

> 许可证:CC BY-SA 4.0 > 任务类别: > - 文本检索(text-retrieval) > 语言: > - 英语(en) > 标签: > - 维基词典(Wiktionary) > - 词典(dictionary) > - 英语(English) > - 语言学(linguistics) > - 词形学(morphology) > - 语义学(semantics) > - 标准化(normalized) > 规模类别: > - 100万 < n < 1000万 # 英语维基词典(English Wiktionary)——标准化SQLite数据库 本数据集为英语维基词典的标准化SQLite数据库,完整收录`cstr/en-wiktionary-extracted`数据集的全部字段。请注意,本数据集未包含使用`wiktextract`工具配合`--all`参数可提取的全部内容,例如译文、例句等。 ## 🎯 核心特性 已收录的字段包括: - 🔗 **释义内维基链接(语义关联)**:定义文本中的维基链接,用于建立语义连接 - 📝 **限定词**:例如“古旧”“美式”“非正式”等语言使用场景标注 - 🏷️ **义项编号**:唯一标识符 - 🌐 **维基数据编号**:用于语义网链接的标识符 - 📚 **书证**:历史引用文献 - 🎭 **词头模板**:词形学相关数据 - 📖 **信息模板**:结构化元数据 - **⚡ 极速查询**:采用全索引架构,单条查询延迟低于20毫秒 - **🔗 语义网支持**:保留义项级粒度的语义关系 - **📱 移动端适配**:针对sq(f)lite(Flutter)及本地数据库场景优化 ## 📊 数据库统计信息 - **词项条目数**:1,243,200 - **词项义项数**:1,361,968 - **释义(释文)数**:1,381,486 - **维基链接数**:2,585,821 - **义项编号数**:1,361,968 - **限定词**:内嵌于义项数据中 - **译文数**:0 - **词形变体数**:700,191 - **词头模板数**:1,237,679 - **发音数据**:0 - **例句数**:0 - **书证数**:4,295 - **维基数据编号数**:2,309 - **同义词数**:214,838 - **反义词数**:11,816 - **上位词数**:9,818 - **下位词数**:22,649 ## 🏗️ 数据库架构(共40+张表) ### 相较于旧版的新增表 - **head_templates**:词形学模板表 - **entry_wikipedia**:维基百科交叉引用表 - **sense_links**:释义内维基链接表 - **sense_raw_tags**:非结构化标签表 - **sense_wikidata**:维基数据标识符表 - **sense_wikipedia**:义项级维基百科关联表 - **attestations**:书证表 - **info_templates**:结构化元数据表 ### 核心表 - **entries**:包含词源信息的核心词项数据表 - **senses**:包含限定词、义项编号、词头序号的释义表 - **translations**:多语言译义词表 - **examples**:使用例句表 - **semantic relations**:语义关系表(涵盖上位词、下位词、部分词、整体词及并列词) ## 📖 使用指南 ### 下载 python from huggingface_hub import hf_hub_download import sqlite3 import gzip import shutil # 下载压缩后的数据库文件 db_gz_path = hf_hub_download( repo_id="cstr/en-wiktionary-sqlite-full", filename="en_wiktionary_normalized_full.db.gz", repo_type="dataset" ) # 解压文件 db_path = db_gz_path.replace('.gz', '') with gzip.open(db_gz_path, 'rb') as f_in: with open(db_path, 'wb') as f_out: shutil.copyfileobj(f_in, f_out) # 建立数据库连接 conn = sqlite3.connect(db_path) ### 示例查询 python # 获取“dog”的带维基链接的释义 cursor.execute(''' SELECT g.gloss_text, GROUP_CONCAT(l.link_text, ', ') as links FROM entries e JOIN senses s ON e.id = s.entry_id JOIN glosses g ON s.id = g.sense_id LEFT JOIN sense_links l ON s.id = l.sense_id WHERE e.word = ? AND e.lang = 'English' GROUP BY g.id ''', ('dog',)) # 获取带有特定限定词(如“古旧”)的词项 cursor.execute(''' SELECT e.word, s.qualifier, g.gloss_text FROM entries e JOIN senses s ON e.id = s.entry_id JOIN glosses g ON s.id = g.sense_id WHERE s.qualifier LIKE '%archaic%' LIMIT 10 ''') # 获取某词项义项对应的维基数据编号 cursor.execute(''' SELECT e.word, w.wikidata_id FROM entries e JOIN senses s ON e.id = s.entry_id JOIN sense_wikidata w ON s.id = w.sense_id WHERE e.word = ? ''', ('cat',)) ## 📜 许可证 本数据集采用与源数据一致的CC-BY-SA 4.0协议。 ## 🔄 版本说明 本版本为**完全无损**版本,完整收录源数据集的40+个字段。
提供机构:
cstr
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作