five

cstr/de-wiktionary-sqlite-full

收藏
Hugging Face2025-11-21 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/cstr/de-wiktionary-sqlite-full
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 task_categories: - text-retrieval language: - de tags: - wiktionary - dictionary - german - linguistics - morphology - semantics - normalized - lossless size_categories: - 1M<n<10M --- # German Wiktionary - FULL Normalized SQLite Database This is a **complete, lossless, and fully normalized** SQLite database of German Wiktionary, capturing 100% of the structured data from the `cstr/de-wiktionary-extracted` dataset. It is designed for production-ready applications, complex linguistic analysis, and mobile apps (Flutter, React Native) that require a comprehensive local dictionary. ## 🎯 Key Features - **✅ 100% Lossless**: All 30+ top-level and nested fields from the source JSONL are preserved. - **⚡ Fast Queries**: Fully indexed schema for sub-20ms queries. - **🔗 Full Semantic Web**: Includes all semantic relations (synonyms, antonyms, **hypernyms, hyponyms, meronyms, holonyms, coordinate_terms**). - **🗣️ Rich Content**: Includes **expressions, proverbs, and entry notes** in addition to definitions and examples. - **📱 Mobile-ready**: Optimized for `sqflite` (Flutter) and other local DB use cases. - **(and all features from the standard DB: forms, translations, sounds, etc.)** ## 📊 Database Statistics - **Entries**: 970,801 - **Word Senses**: 3,098,364 - **Definitions (Glosses)**: 3,087,300 - **Translations**: 1,131,251 - **Word Forms (Inflections)**: 6,100,090 - **Form Tags (Total)**: 25,966,680 - **Pronunciations (Sounds)**: 2,327,762 - **Usage Examples**: 427,322 - **Synonyms**: 161,563 - **Antonyms**: 76,054 - **Hypernyms**: 133,059 - **Hyponyms**: 217,179 - **Proverbs**: 1,078 - **Expressions**: 13,138 - **Descendants**: 211 - **Entry Notes**: 16,536 - **Unique Tags**: 185 - **Unique Topics**: 58 - **Unique Categories**: 352 ## 🏗️ Database Schema (Full) This schema includes all tables from the standard `de-wiktionary-sqlite-normalized` dataset, plus the following additions: - **entries**: - `title`: The Wiktionary page title. - `redirect`: The page this entry redirects to (if any). - **entry_notes**: (New Table) Free-text notes associated with an entry (e.g., "Es gibt etliche Belege für die Steigerung..."). - **other_pos**: (New Table) Alternative part-of-speech values for this word. - **entry_raw_tags**: (New Table) Unparsed, raw tags from Wiktionary. - **descendants**: (New Table) Words in other languages descended from this word. - **hypernyms**: (New Table) "Is-a" relationship (e.g., "Tier" is a hypernym of "Hund"). - **hyponyms**: (New Table) "Type-of" relationship (e.g., "Hund" is a hyponym of "Tier"). - **holonyms**: (New Table) "Part-of" relationship (e.g., "Hand" is a holonym of "Finger"). - **meronyms**: (New Table) "Has-a" relationship (e.g., "Finger" is a meronym of "Hand"). - **coordinate_terms**: (New Table) Sibling terms (e.g., "Hund" and "Katze" are coordinate terms under "Haustier"). - **expressions**: (New Table) Idiomatic expressions using the word (linked to `sense_id`). - **proverbs**: (New Table) Proverbs using the word (linked to `sense_id`). *(For the standard schema, see the `cstr/de-wiktionary-sqlite-normalized` dataset card)* ## 📖 Usage ### Download ```python from huggingface_hub import hf_hub_download import sqlite3 import gzip import shutil # Download compressed database db_gz_path = hf_hub_download( repo_id="cstr/de-wiktionary-sqlite-full", filename="de_wiktionary_normalized_full.db", repo_type="dataset" ) # Decompress (if it's .gz) db_path = db_gz_path.replace('.gz', '') with gzip.open(db_gz_path, 'rb') as f_in: with open(db_path, 'wb') as f_out: shutil.copyfileobj(f_in, f_out) # Connect conn = sqlite3.connect(db_path) ``` ### Example Query (New Tables) ```python # Get all hypernyms (parent categories) for "Hund" cursor.execute(''' SELECT h.hypernym_word FROM entries e JOIN hypernyms h ON e.id = h.entry_id WHERE e.word = ? AND e.lang = 'Deutsch' ''', ('Hund',)) print("Hypernyms of 'Hund':", [row[0] for row in cursor.fetchall()]) ``` ## 🔗 Source Original data: [cstr/de-wiktionary-extracted](https://huggingface.co/datasets/cstr/de-wiktionary-extracted) ## 📜 License CC-BY-SA 4.0 (same as source)

### 数据集元数据 - 许可证:CC BY-SA 4.0 - 任务类别:文本检索(text-retrieval) - 语言:德语(de) - 标签:维基词典(wiktionary)、词典(dictionary)、德语(german)、语言学(linguistics)、词形学(morphology)、语义学(semantics)、归一化(normalized)、无损(lossless) - 数据规模:100万 < 数据条目 < 1000万 # 德语维基词典——完整归一化SQLite数据库 本数据集为**完整、无损且完全归一化**的德语维基词典SQLite数据库,完整复刻了`cstr/de-wiktionary-extracted`数据集的全部结构化数据。 其设计面向可投入生产的应用程序、复杂语言学分析,以及需要本地化词典的移动端应用(如Flutter、React Native)。 ## 🎯 核心特性 - ✅ 100%无损:完整保留源JSONL文件中全部30余个顶级及嵌套字段。 - ⚡ 快速查询:采用全索引架构,单条查询耗时可控制在20毫秒以内。 - 🔗 完整语义网络:涵盖所有语义关系,包括同义词、反义词、**上位词(hypernyms)、下位词(hyponyms)、部分词(meronyms)、整体词(holonyms)、并列词(coordinate_terms)**。 - 🗣️ 丰富内容:除标准释义与例句外,还包含**固定表达、谚语与词条注释**。 - 📱 移动端适配:针对`sqflite`(Flutter)及其他本地数据库使用场景进行优化。 - (同时包含标准数据库的所有特性:词形、译文、语音资源等) ## 📊 数据库统计 - 词条:970,801 - 词义项:3,098,364 - 释义(释文):3,087,300 - 译文:1,131,251 - 词形(屈折形式):6,100,090 - 词形标签(总计):25,966,680 - 发音(语音资源):2,327,762 - 用例例句:427,322 - 同义词:161,563 - 反义词:76,054 - 上位词(hypernyms):133,059 - 下位词(hyponyms):217,179 - 谚语:1,078 - 固定表达:13,138 - 派生词:211 - 词条注释:16,536 - 唯一标签:185 - 唯一主题:58 - 唯一分类:352 ## 🏗️ 数据库完整架构 本架构包含标准`de-wiktionary-sqlite-normalized`数据集的全部表结构,新增以下表: 1. **entries(词条表)**: - `title`:维基词典页面标题 - `redirect`:当前词条重定向指向的页面(若存在) 2. **entry_notes(新增表)**:与词条关联的自由文本注释(例如:"该词存在多种级别的用法佐证") 3. **other_pos(新增表)**:该词的其他词性标注 4. **entry_raw_tags(新增表)**:从维基词典提取的未解析原始标签 5. **descendants(新增表)**:源自该词的其他语言派生词 6. **hypernyms(新增表)**:「是一类」关系(例如:"动物(Tier)"是"狗(Hund)"的上位词) 7. **hyponyms(新增表)**:「属于某类」关系(例如:"狗(Hund)"是"动物(Tier)"的下位词) 8. **holonyms(新增表)**:「是整体的一部分」关系(例如:"手(Hand)"是"手指(Finger)"的整体词) 9. **meronyms(新增表)**:「包含某部分」关系(例如:"手指(Finger)"是"手(Hand)"的部分词) 10. **coordinate_terms(新增表)**:同级并列词(例如:"狗(Hund)"与"猫(Katze)"均为"宠物(Haustier)"的并列词) 11. **expressions(新增表)**:使用该词的习语表达(关联至`sense_id`) 12. **proverbs(新增表)**:使用该词的谚语(关联至`sense_id`) *标准架构详情请参阅`cstr/de-wiktionary-sqlite-normalized`数据集卡片* ## 📖 使用指南 ### 下载 python from huggingface_hub import hf_hub_download import sqlite3 import gzip import shutil # 下载压缩后的数据库文件 db_gz_path = hf_hub_download( repo_id="cstr/de-wiktionary-sqlite-full", filename="de_wiktionary_normalized_full.db", repo_type="dataset" ) # 若文件为GZIP压缩格式则进行解压 db_path = db_gz_path.replace('.gz', '') with gzip.open(db_gz_path, 'rb') as f_in: with open(db_path, 'wb') as f_out: shutil.copyfileobj(f_in, f_out) # 连接数据库 conn = sqlite3.connect(db_path) ### 示例查询(新增表) python # 获取"Hund"(狗)的所有上位词(父分类) cursor.execute(''' SELECT h.hypernym_word FROM entries e JOIN hypernyms h ON e.id = h.entry_id WHERE e.word = ? AND e.lang = 'Deutsch' ''', ('Hund',)) print("'Hund' 的上位词:", [row[0] for row in cursor.fetchall()]) ## 🔗 数据来源 原始数据:[cstr/de-wiktionary-extracted](https://huggingface.co/datasets/cstr/de-wiktionary-extracted) ## 📜 许可证 CC-BY-SA 4.0(与源数据集保持一致)
提供机构:
cstr
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作